Closed MarianaBazely closed 1 year ago
related to #553
@Th368MoJ i wonder if this could be to do with the pandas null issue. I think it's an outside chance, but may be worth Mariana trying first because it's simple.
@MarianaBazely Could you try ensuring all null values are None
rather than np.nan
or pd.NA
and rerunning.
Sample code to clear up zero length strings and any instances of numpy/pandas nulls:
data = [
{'a': '', 'b': pd.NA, 'c':np.nan}
]
df = pd.DataFrame(data)
# deal with col a
df = df.replace(r'^\s*$', None, regex=True)
# deal with col b and c
df2 = df.fillna(np.nan).replace([np.nan, pd.NA], [None, None])
When I changed my blocking rules I was able to generate the image...
Did you manage to identify which blocking rule was causing the problem? Would be useful to know as it may help us track down this problem
Did clearing up the nulls make any difference?
All nulls have been replaced to np.nan. I can remove all rows with nulls instead.
These were the blocking rules that "caused" (I'm not sure if they are guilty) the unicode error:
"blocking_rules_to_generate_predictions": [
"l.finitial = r.finitial and l.sinitial = r.sinitial and l.bmonth = r.bmonth",
"l.finitial = r.finitial and l.sinitial = r.sinitial and l.bday = r.bday",
"l.finitial = r.finitial and l.sinitial = r.sinitial and l.byear = r.byear",
"l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla"
Anything else, for example this huge one below is fine:
"blocking_rules_to_generate_predictions": [
"l.finitial = r.finitial and l.sinitial = r.sinitial and l.bmonth = r.bmonth",
"l.finitial = r.finitial and l.sinitial = r.sinitial and l.bday = r.bday",
"l.finitial = r.finitial and l.sinitial = r.sinitial and l.byear = r.byear",
"l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla",
"l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla and l.ethnicity = r.ethnicity",
"l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla and l.ethnicity = r.ethnicity and l.sex = r.sex",
"l.finitial = r.finitial and l.sinitial = r.sinitial and r.ethnicity = l.ethnicity and l.byear = r.byear and l.utla = r.utla",
"l.sinitial = r.sinitial and l.finitial = r.finitial and l.bday = r.bday and l.byear = r.byear and l.bmonth = r.bmonth \
and r.ethnicity = l.ethnicity and l.utla = r.utla",
"l.sinitial = r.sinitial and l.finitial = r.finitial and l.bday = r.bday and l.byear = r.byear and l.bmonth = r.bmonth \
and r.ethnicity = l.ethnicity and l.utla = r.utla and l.sex = r.sex",
Weird! Please could you try with the broken blocking rules with the with the code I posted above that removes np.nan and pd.na with pure python None?
I am experiencing the same error
Traceback (most recent call last):
File "c:/path/test_splink.py", line 92, in <module>
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)
File "c:/path/to/env/lib\site-packages\splink\linker.py", line 999, in estimate_parameters_using_expectation_maximisation
em_training_session._train()
File "c:/path/to/env/lib\site-packages\splink\em_training_session.py", line 165, in _train
cvv = self._comparison_vectors()
File "c:/path/to/env/lib\site-packages\splink\em_training_session.py", line 161, in _comparison_vectors
return self._training_linker._execute_sql_pipeline(input_dataframes)
File "c:/path/to/env/lib\site-packages\splink\linker.py", line 346, in _execute_sql_pipeline
dataframe = self._sql_to_splink_dataframe_checking_cache(
File "c:/path/to/env/lib\site-packages\splink\linker.py", line 481, in _sql_to_splink_dataframe_checking_cache
splink_dataframe = self._execute_sql_against_backend(
File "c:/path/to/env/lib\site-packages\splink\duckdb\duckdb_linker.py", line 190, in _execute_sql_against_backend
self._con.execute(sql).fetch_df()
duckdb.InternalException: INTERNAL Error: Invalid unicode detected in segment statistics update!
I am working on Windows 10 with an input dataset of 2 million rows, 23 columns.
I tried changing blanks/na/nan to None with @RobinL's suggested code but it did not help:
df = df.replace(r'^\s*$', None, regex=True)
df = df.fillna(np.nan).replace([np.nan, pd.NA], [None, None])
I also tried discarding all non-ASCII characters which also did not help:
for col in df.columns[1:]:
df[col] = df[col].str.encode('ascii', 'ignore').str.decode('ascii', 'ignore')
These are the blocking rules
'blocking_rules_to_generate_predictions' = [
'l.first_name = r.first_name and l.last_name = r.last_name',
'l.birth_date = r.birth_date',
'l.properties_city = r.properties_city and l.country = r.country'
]
With these blocking rules the error goes away
'blocking_rules_to_generate_predictions' = [
'l.first_name = r.first_name and l.last_name = r.last_name',
'l.birth_date = r.birth_date'
]
I also tried playing around with the size of the input data set (under the apparently incorrect assumption that it was bad character encoding causing the error). If I narrow the input to the first 14,000 rows the error still occurs. If I narrow it down to the first 13,000 rows the error does not occur.
hi, I converted my json data to parquet format file(by java), and using "create table ...as select * from read_parquet()" sql to import data to duck db , meet same exception; my data has 3 field, one of fields is a number type and has null value, Looking forward to your reply
Thanks for the info. Are you on windows? Is the source of the data Microsoft SQL server? One work around is to use the Spark linker, which doesn't suffer from this problem
I am on Windows. Original source of the data is MSSQL, but I am just reading it from a CSV
Thanks - that seems to be the pattern - every time we've seen this error the data is from mssql server on windows. We'll continue to look into it. Its been challenging because I haven't yet been able to come up with a small/simple reproducible example
yes, i'm on win10. using spark it worked
@leewilson-kmd I don't suppose it would be possible to share the data that causes the error? (The CSV file). If so, would be best to share direct with mark@duckdblabs.com (see https://github.com/duckdb/duckdb/issues/1650)
It looks like this finally may be solved on the latest master of duckdb: https://github.com/duckdb/duckdb/issues/1650#issuecomment-1312534236
Would be useful if people who have seen this issue would check whether it's gone away on the latest duckdb, so we can bump our dependencies and close this issue
Closing this as no-one has reported it for a long time and duckdb think it's fixed
Hi,
this is the error I get when I run
clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.95)
:Some extra info: Splink version 3.0.1
Name Version Build Channel
anaconda 2022.05 py39_0 anaconda-client 1.9.0 py39haa95532_0 anaconda-navigator 2.1.4 py39haa95532_0 anaconda-project 0.10.2 pyhd3eb1b0_0
Windows specifications Edition Windows 10 Enterprise Version 20H2
Thank you very much