moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.28k stars 146 forks source link

INTERNAL Error: Invalid unicode detected in segment statistics update! #664

Closed MarianaBazely closed 1 year ago

MarianaBazely commented 2 years ago

Hi,

this is the error I get when I run clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.95):


`---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [51], in <cell line: 1>()
----> 1 clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.95)

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:1211, in Linker.cluster_pairwise_predictions_at_threshold(self, df_predict, threshold_match_probability)
   1203 self._initialise_df_concat_with_tf(df_predict)
   1205 edges_table = _cc_create_unique_id_cols(
   1206     self,
   1207     df_predict,
   1208     threshold_match_probability,
   1209 )
-> 1211 cc = solve_connected_components(self, edges_table)
   1213 return cc

File ~\AppData\Roaming\Python\Python310\site-packages\splink\connected_components.py:411, in solve_connected_components(linker, edges_table, _generated_graph)
    409 sql = _cc_update_representatives_first_iter()
    410 # Execute if we have no batching, otherwise add it to our batched process
--> 411 representatives = linker._enqueue_and_execute_sql_pipeline(
    412     sql, "__splink__df_representatives"
    413 )
    414 prev_representatives_table = representatives
    416 # Loop while our representative table still has unsettled nodes

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:364, in Linker._enqueue_and_execute_sql_pipeline(self, sql, output_table_name, materialise_as_hash, use_cache, transpile)
    361 """Wrapper method to enqueue and execute a sql pipeline in a single call."""
    363 self._enqueue_sql(sql, output_table_name)
--> 364 return self._execute_sql_pipeline([], materialise_as_hash, use_cache, transpile)

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:323, in Linker._execute_sql_pipeline(self, input_dataframes, materialise_as_hash, use_cache, transpile)
    319     sql_gen = self._pipeline._generate_pipeline(input_dataframes)
    321     output_tablename_templated = self._pipeline.queue[-1].output_table_name
--> 323     dataframe = self._sql_to_splink_dataframe(
    324         sql_gen,
    325         output_tablename_templated,
    326         materialise_as_hash,
    327         use_cache,
    328         transpile,
    329     )
    330     return dataframe
    331 else:
    332     # In debug mode, we do not pipeline the sql and print the
    333     # results of each part of the pipeline

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:405, in Linker._sql_to_splink_dataframe(self, sql, output_tablename_templated, materialise_as_hash, use_cache, transpile)
    402     print(sql)
    404 if materialise_as_hash:
--> 405     splink_dataframe = self._execute_sql(
    406         sql, output_tablename_templated, table_name_hash, transpile=transpile
    407     )
    408 else:
    409     splink_dataframe = self._execute_sql(
    410         sql,
    411         output_tablename_templated,
    412         output_tablename_templated,
    413         transpile=transpile,
    414     )

File ~\AppData\Roaming\Python\Python310\site-packages\splink\duckdb\duckdb_linker.py:227, in DuckDBLinker._execute_sql(self, sql, templated_name, physical_name, transpile)
    220 logger.log(5, log_sql(sql))
    222 sql = f"""
    223 CREATE TABLE {physical_name}
    224 AS
    225 ({sql})
    226 """
--> 227 self._con.execute(sql).fetch_df()
    229 return DuckDBLinkerDataFrame(templated_name, physical_name, self)

RuntimeError: INTERNAL Error: INTERNAL Error: Invalid unicode detected in segment statistics update!
​`

Some extra info: Splink version 3.0.1

Name Version Build Channel

anaconda 2022.05 py39_0 anaconda-client 1.9.0 py39haa95532_0 anaconda-navigator 2.1.4 py39haa95532_0 anaconda-project 0.10.2 pyhd3eb1b0_0

Windows specifications Edition Windows 10 Enterprise Version 20H2

Thank you very much

mamonu commented 2 years ago

related to #553

RobinL commented 2 years ago

@Th368MoJ i wonder if this could be to do with the pandas null issue. I think it's an outside chance, but may be worth Mariana trying first because it's simple.

@MarianaBazely Could you try ensuring all null values are None rather than np.nan or pd.NA and rerunning.

Sample code to clear up zero length strings and any instances of numpy/pandas nulls:

data = [
    {'a': '', 'b': pd.NA, 'c':np.nan}
]

df = pd.DataFrame(data)

# deal with col a
df = df.replace(r'^\s*$', None, regex=True)

# deal with col b and c
df2 = df.fillna(np.nan).replace([np.nan, pd.NA], [None, None])
MarianaBazely commented 2 years ago

When I changed my blocking rules I was able to generate the image...

RobinL commented 2 years ago

Did you manage to identify which blocking rule was causing the problem? Would be useful to know as it may help us track down this problem

Did clearing up the nulls make any difference?

MarianaBazely commented 2 years ago

All nulls have been replaced to np.nan. I can remove all rows with nulls instead.

These were the blocking rules that "caused" (I'm not sure if they are guilty) the unicode error:

    "blocking_rules_to_generate_predictions": [
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.bmonth = r.bmonth",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.bday = r.bday",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.byear = r.byear",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla"

Anything else, for example this huge one below is fine:

    "blocking_rules_to_generate_predictions": [
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.bmonth = r.bmonth",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.bday = r.bday",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.byear = r.byear",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla and l.ethnicity = r.ethnicity",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla and l.ethnicity = r.ethnicity and l.sex = r.sex",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and r.ethnicity = l.ethnicity and l.byear = r.byear and l.utla = r.utla",
    "l.sinitial = r.sinitial and l.finitial = r.finitial and l.bday = r.bday and l.byear = r.byear and l.bmonth = r.bmonth \
    and r.ethnicity = l.ethnicity and l.utla = r.utla",
    "l.sinitial = r.sinitial and l.finitial = r.finitial and l.bday = r.bday and l.byear = r.byear and l.bmonth = r.bmonth \
    and r.ethnicity = l.ethnicity and l.utla = r.utla and l.sex = r.sex",
RobinL commented 2 years ago

Weird! Please could you try with the broken blocking rules with the with the code I posted above that removes np.nan and pd.na with pure python None?

leewilson-kmd commented 1 year ago

I am experiencing the same error

Traceback (most recent call last):
  File "c:/path/test_splink.py", line 92, in <module>
    linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)
  File "c:/path/to/env/lib\site-packages\splink\linker.py", line 999, in estimate_parameters_using_expectation_maximisation
    em_training_session._train()
  File "c:/path/to/env/lib\site-packages\splink\em_training_session.py", line 165, in _train
    cvv = self._comparison_vectors()
  File "c:/path/to/env/lib\site-packages\splink\em_training_session.py", line 161, in _comparison_vectors
    return self._training_linker._execute_sql_pipeline(input_dataframes)
  File "c:/path/to/env/lib\site-packages\splink\linker.py", line 346, in _execute_sql_pipeline
    dataframe = self._sql_to_splink_dataframe_checking_cache(
  File "c:/path/to/env/lib\site-packages\splink\linker.py", line 481, in _sql_to_splink_dataframe_checking_cache
    splink_dataframe = self._execute_sql_against_backend(
  File "c:/path/to/env/lib\site-packages\splink\duckdb\duckdb_linker.py", line 190, in _execute_sql_against_backend
    self._con.execute(sql).fetch_df()
duckdb.InternalException: INTERNAL Error: Invalid unicode detected in segment statistics update!

I am working on Windows 10 with an input dataset of 2 million rows, 23 columns.

I tried changing blanks/na/nan to None with @RobinL's suggested code but it did not help:

df = df.replace(r'^\s*$', None, regex=True)
df = df.fillna(np.nan).replace([np.nan, pd.NA], [None, None])

I also tried discarding all non-ASCII characters which also did not help:

for col in df.columns[1:]:
    df[col] = df[col].str.encode('ascii', 'ignore').str.decode('ascii', 'ignore')

These are the blocking rules

'blocking_rules_to_generate_predictions' = [
  'l.first_name = r.first_name and l.last_name = r.last_name',
  'l.birth_date = r.birth_date',
  'l.properties_city = r.properties_city and l.country = r.country'
]

With these blocking rules the error goes away

'blocking_rules_to_generate_predictions' = [
  'l.first_name = r.first_name and l.last_name = r.last_name',
  'l.birth_date = r.birth_date'
]

I also tried playing around with the size of the input data set (under the apparently incorrect assumption that it was bad character encoding causing the error). If I narrow the input to the first 14,000 rows the error still occurs. If I narrow it down to the first 13,000 rows the error does not occur.

Stongtong commented 1 year ago

hi, I converted my json data to parquet format file(by java), and using "create table ...as select * from read_parquet()" sql to import data to duck db , meet same exception; my data has 3 field, one of fields is a number type and has null value, Looking forward to your reply

RobinL commented 1 year ago

Thanks for the info. Are you on windows? Is the source of the data Microsoft SQL server? One work around is to use the Spark linker, which doesn't suffer from this problem

leewilson-kmd commented 1 year ago

I am on Windows. Original source of the data is MSSQL, but I am just reading it from a CSV

RobinL commented 1 year ago

Thanks - that seems to be the pattern - every time we've seen this error the data is from mssql server on windows. We'll continue to look into it. Its been challenging because I haven't yet been able to come up with a small/simple reproducible example

Stongtong commented 1 year ago

yes, i'm on win10. using spark it worked

RobinL commented 1 year ago

@leewilson-kmd I don't suppose it would be possible to share the data that causes the error? (The CSV file). If so, would be best to share direct with mark@duckdblabs.com (see https://github.com/duckdb/duckdb/issues/1650)

RobinL commented 1 year ago

It looks like this finally may be solved on the latest master of duckdb: https://github.com/duckdb/duckdb/issues/1650#issuecomment-1312534236

Would be useful if people who have seen this issue would check whether it's gone away on the latest duckdb, so we can bump our dependencies and close this issue

RobinL commented 1 year ago

Closing this as no-one has reported it for a long time and duckdb think it's fixed