Open RobinL opened 5 days ago
Fix is that table registration should accept an arrow row https://github.com/moj-analytical-services/splink/blob/8b44ab58d39a798a443e1ec5ddef6149f072ace2/splink/internals/spark/database_api.py#L72
Actually that's no good because you can't pass arrow directly to Spark
from pyspark.sql.types import StructType, StructField, StringType
r1 = {
"first_name": "John",
"surname": "Smith",
"dob": None
}
r2 = {
"first_name": "John",
"surname": "Smith",
"dob": "1980-01-01",
}
schema = StructType([
StructField("first_name", StringType(), True),
StructField("surname", StringType(), True),
StructField("dob", StringType(), True)
])
in_1 = spark.createDataFrame([r1], schema=schema)
in_2 = spark.createDataFrame([r2], schema=schema)
# linker.inference.compare_two_records(r1, r2).as_pandas_dataframe()
import pandas as pd
linker.inference.compare_two_records(
in_1, in_2
).as_pandas_dataframe()
Should probably be alloewd
The only reason you can't do that at the moment is that we add [] around the record! We should only do that if it's a dict
That should fix
I applied a fix that allows two schemas sparkdataframes to be passed in in compre two records:
if isinstance(record_1, dict):
record_1 = [record_1]
if isinstance(record_2, dict):
record_2 = [record_2]
uid = ascii_uid(8)
df_records_left = self._linker.table_management.register_table(
record_1, f"__splink__compare_two_records_left_{uid}", overwrite=True
)
df_records_left.templated_name = "__splink__compare_two_records_left"
df_records_right = self._linker.table_management.register_table(
record_2, f"__splink__compare_two_records_right_{uid}", overwrite=True
)
df_records_right.templated_name = "__splink__compare_two_records_right"
But giving up for now because the number of paritions seems to explode even when running the query in plain spark:
which results inexplicably in something like [Stage 0:> (252 + 12) / 20736]
even though predict()
is basically the same query.
I've tried repartitioning, going through pandas, etc and the result always seems to be the same
This seems to fix:
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "1MB")
In Splink 4 the thing that changed is that blocking results in a pairwise table of records.
That's probably the cause of the bug
It's a bit of hassle, but the fix is probably to cut the blocking step entirely out of compare_two_records. Since we know what the result is - we just need it to be a table iwth a single row of like _left = _right
This fails in Spark:
with
Click to expand
```python from pyspark.context import SparkConf, SparkContext from pyspark.sql import SparkSession import splink.comparison_library as cl from splink import ( Linker, SettingsCreator, SparkAPI, block_on, splink_datasets, ) from splink.backends.spark import similarity_jar_location path = similarity_jar_location() df_pandas = splink_datasets.fake_1000 conf = SparkConf() conf.set("spark.jars", path) conf.set("spark.driver.memory", "12g") conf.set("spark.sql.shuffle.partitions", "12") conf.set("spark.default.parallelism", "12") sc = SparkContext.getOrCreate(conf=conf) sc.setCheckpointDir("tmp_checkpoints/") spark = SparkSession(sc) df = spark.createDataFrame(df_pandas) db_api = SparkAPI( spark_session=spark, break_lineage_method="parquet", num_partitions_on_repartition=6, ) df = splink_datasets.fake_1000 settings = SettingsCreator( link_type="dedupe_only", comparisons=[ cl.ExactMatch("first_name"), cl.ExactMatch("surname"), cl.ExactMatch("dob"), ], blocking_rules_to_generate_predictions=[ block_on("first_name"), block_on("surname"), ], max_iterations=2, ) linker = Linker(df, settings, db_api) pairwise_predictions = linker.inference.predict(threshold_match_weight=-10) r1 = { "first_name": "John", "surname": "Smith", "dob": "1980-01-01", } r2 = { "first_name": "John", "surname": "Smith", "dob": None, } pd.DataFrame([r1, r2]) linker.inference.compare_two_records(r1, r2).as_pandas_dataframe() ```Ultimate issue is with https://github.com/moj-analytical-services/splink/blob/8b44ab58d39a798a443e1ec5ddef6149f072ace2/splink/internals/spark/database_api.py#L64-L76