Open ADBond opened 1 week ago
It seems to be something to do with that dataset, as things appear fine if we use splink_datasets.fake_1000
. Or quite possibly I am doing something stupid.
Introduced since 4.0.2
Alright, so I think the issue here is reusing templated names vs physical names. Only affects clustering with multiple iterations. So instead of SQL like:
select node_id as node_id, representative as cluster_id
from __splink__representatives_stable_16f74d750 UNION ALL select node_id as node_id, representative as cluster_id
from __splink__representatives_stable_f92f88f19 UNION ALL select node_id as node_id, representative as cluster_id
from __splink__representatives_stable_67585d97a UNION ALL select node_id as node_id, representative as cluster_id
from __splink__representatives_stable_8c1eef1c6 UNION ALL select node_id as node_id, representative as cluster_id
from __splink__df_representatives_4_6f14fb719
in debug mode we instead get
select node_id as node_id, representative as cluster_id
from __splink__representatives_stable UNION ALL select node_id as node_id, representative as cluster_id
from __splink__representatives_stable UNION ALL select node_id as node_id, representative as cluster_id
from __splink__representatives_stable UNION ALL select node_id as node_id, representative as cluster_id
from __splink__representatives_stable UNION ALL select node_id as node_id, representative as cluster_id
from __splink__df_representatives_4
So instead of getting stable nodes from all iterations, we end up with (number of iterations) copies of the final set of stable nodes.
If we have
debug_mode
on when we cluster, we get incorrect results. Most nodes are missing, and the nodes that we do have each appear twice.See:
Splink 4.0.4
Not looked into what is causing this yet