Open NickCrews opened 9 months ago
Yeah, I agree. I think (but not sure, haven't thought that hard) that this is the same as the infinity protection. Which would lead me to think the case statement would be the best option (just for symmetry)
I have wondered before whether there's a third option (which also could potentially deal with the infinity case statements) of having this logic on the python side rather than in the sql.
e.g. could we clamp the value of the bf_ to between some values (e.g. between 1e-100, 1e100 or something), and issue a warning
Another thing I've wondered a bit about is whether we should move to the bf_ being match weights in the sql, which would then be additive. That would at least avoid (possibly?) the floating point issue.
Hey attempting to test out this package and I am receiving this error. I can see the columns that make up this function call failure in sql, but what exactly would be the way to research and address this problem?
This is occurring on my first test dataset, so I do not know how to address it or if it is incorrect column profiling.
Hi there, I am also running into this issue, using the duckdb backend. Does anyone know a workaround to fix this issue, at least for the moment? I understand where the issue comes from based on the discussion above, but do not know where to start with the suggested solutions, e.g. adding a tiny delta to prevent log(0).
A simple workaround fix is duplicate a single row with a new unique id. Just one row that is an exact match to another. For me that was enough to fix it.
It'd be great if someone could find a reprex for this issue. I've not actually encountered it myself. not doubting it exists - I suspect it happens with data of a certain type that we don't usually encounter, possibly such as certain values having no dupes (as vfrank66 alludes to).
If not a reprex, @JohnHenningsen are you able to post a screenshot of the match weight charts - it's possible that provides some insights...
In any case, we should hopefully be able to get round to fixing fairly soon, once Splink 4 is released (which has been absorbing most of our time for some months now)
@RobinL Im just re-reading your original response, and yes I think we should switch to combining match weights additively, otherwise Im pretty sure we will run into floating point errors. So that might make this whole thing moot?
Yeah, agree, it's definitely the right solution. The problem is it's quite a big job because all the visualisations and dashboards expect data in the current format
Thanks for the helpful suggestions everyone! Unfortunately our cluster is down at the moment but I will try the simple workaround and help reproduce this issue as soon as possible.
To give a bit of context, aside from a few columns of categorical data we are relying on a product description column to match records. That column contains a 3-10 word string, and we came up with some custom comparisons based on array_intersect. It is quite likely that there are no exact matches, as we have high variance in the data entry for this column.
What happens?
I think we need to add in some safeguards when calculating
log2(*bayes_factors) AS match_weight
.I didn't include a reproducible example, but I think you can see how this would come about:
then the arg passed to log2 will be zero.
Two options?
LOG2(.00000000000000000000000001 + bf1*bf2*bf3...)
CASE WHEN <args> > 0 THEN LOG2(<args>> ELSE "-inf" END
or similarTo Reproduce
sorry, if you really want I can come up with something, but I think you might be better able to come up with it than me.
OS:
duckdb
Splink version:
4.9.11
Have you tried this on the latest
master
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?