moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.36k stars 148 forks source link

[Splink 4] DateComparison - 1st January matching not working #1859

Closed ADBond closed 3 months ago

ADBond commented 9 months ago

As noted in #1856 the 1st January matching level from DateComparison is not working properly.

Currently we construct the SQL directly and use it in a CustomLevel. The trouble with this is that we need to use the .name of a ColumnExpression (really a name_l), which needs to know about the expression's dialect. But we do not have a dialect when we construct the level (creators) with ComparisonCreator.create_comparison_levels().

My suggestion is a comparison level ExactMatchWithLiteralLevel which would output SQL like col_l = col_r AND col_l = 'some_literal' from say ExactMatchWithLiteralLevel("col", literal="some_literal", type="string"). This could be added as an option to ExactMatchLevel, but I think as that is quite a key level it would be best to avoid complicating it any further, particularly as this might be a little fiddly (dealing with different types, for example). I also think that ExactMatchWithLiteralLevel has some utility in general - I think it is relatively common for there to be 'special' values that we might want to handle separately.

RobinL commented 9 months ago

Been trying to get my head around this - not sure I fully understand, but is this a possible solution:

import splink.comparison_level_library as cll
from splink.column_expression import ColumnExpression

comparison_level = cll.And(
    cll.ExactMatchLevel(
        ColumnExpression("dob").try_parse_date(date_format="YYYY-MM-DD")
    ),
    cll.LiteralMatchLevel(
        ColumnExpression("dob").substr(6, 4),
        literal_value="01-01",
        side_of_comparison="left",
    ),
)

sql_condition = comparison_level.get_comparison_level("duckdb").as_dict()[
    "sql_condition"
]

print(sql_condition)
(
    NULLIF(try_strptime("dob_l", 'YYYY-MM-DD'), '') = NULLIF(try_strptime("dob_r", 'YYYY-MM-DD'), '')
)
AND (
    SUBSTRING("dob_l", 6, 4) = '01-01'
    AND SUBSTRING("dob_r", 6, 4) = '01-01'
)

Branch with the LiteralMatchLevel implementation here: https://github.com/moj-analytical-services/splink/compare/splink4_dev...literal_match_level

ADBond commented 9 months ago

Yeah, that's exactly the sort of thing I mean.

The key thing is being able to create a ComparisonLevelCreator that can delay construction until we know the dialect, which we can't do typically with a CustomLevel

RobinL commented 9 months ago

I think i'm reasonably happy with the above proposal - composition using AND seems fairly intuitive to me.

The only bit I'm not sure of is how to deal with the type of literals. At the moment the implementation assumes its a string, so you pass literal_value = "hello". Which is fine for a string, but not if you want a literal int or whatever.

This doesn't feel great:

# string
literal value = "'hello'". 

# date
literal value = "date'2020-01-01''"

# int
literal value = "1"

We could somehow introduce a dialect agnostic literal function like DateLiteral("2020-01-01") but that feels quite complicated.

Any thoughts?

RobinL commented 9 months ago

I guess it could also be built into LiteralMatchLevel as an arg

  cll.LiteralMatchLevel(
        ColumnExpression("dob").substr(6, 4),
        literal_value="01-01",
        side_of_comparison="left",
        literal_type='string'
    ),
ADBond commented 9 months ago

I guess it could also be built into LiteralMatchLevel as an arg

Yeah this is what I was envisaging - also has the advantage that we can handle dialect-dependence (of say a date literal) rather than force that on users. If down the line we find we frequently need this sort of thing we could always build out a DateLiteral("2020-01-01") with this functionality, but from a user POV I think just an argument is probably nicest

RobinL commented 3 months ago

Closed by https://github.com/moj-analytical-services/splink/pull/2281