Closed ADBond closed 3 months ago
Been trying to get my head around this - not sure I fully understand, but is this a possible solution:
import splink.comparison_level_library as cll
from splink.column_expression import ColumnExpression
comparison_level = cll.And(
cll.ExactMatchLevel(
ColumnExpression("dob").try_parse_date(date_format="YYYY-MM-DD")
),
cll.LiteralMatchLevel(
ColumnExpression("dob").substr(6, 4),
literal_value="01-01",
side_of_comparison="left",
),
)
sql_condition = comparison_level.get_comparison_level("duckdb").as_dict()[
"sql_condition"
]
print(sql_condition)
(
NULLIF(try_strptime("dob_l", 'YYYY-MM-DD'), '') = NULLIF(try_strptime("dob_r", 'YYYY-MM-DD'), '')
)
AND (
SUBSTRING("dob_l", 6, 4) = '01-01'
AND SUBSTRING("dob_r", 6, 4) = '01-01'
)
Branch with the LiteralMatchLevel implementation here: https://github.com/moj-analytical-services/splink/compare/splink4_dev...literal_match_level
Yeah, that's exactly the sort of thing I mean.
The key thing is being able to create a ComparisonLevelCreator
that can delay construction until we know the dialect, which we can't do typically with a CustomLevel
I think i'm reasonably happy with the above proposal - composition using AND
seems fairly intuitive to me.
The only bit I'm not sure of is how to deal with the type of literals. At the moment the implementation assumes its a string, so you pass literal_value = "hello". Which is fine for a string, but not if you want a literal int or whatever.
This doesn't feel great:
# string
literal value = "'hello'".
# date
literal value = "date'2020-01-01''"
# int
literal value = "1"
We could somehow introduce a dialect agnostic literal function like DateLiteral("2020-01-01") but that feels quite complicated.
Any thoughts?
I guess it could also be built into LiteralMatchLevel
as an arg
cll.LiteralMatchLevel(
ColumnExpression("dob").substr(6, 4),
literal_value="01-01",
side_of_comparison="left",
literal_type='string'
),
I guess it could also be built into
LiteralMatchLevel
as an arg
Yeah this is what I was envisaging - also has the advantage that we can handle dialect-dependence (of say a date literal) rather than force that on users. If down the line we find we frequently need this sort of thing we could always build out a DateLiteral("2020-01-01")
with this functionality, but from a user POV I think just an argument is probably nicest
As noted in #1856 the 1st January matching level from
DateComparison
is not working properly.Currently we construct the SQL directly and use it in a
CustomLevel
. The trouble with this is that we need to use the.name
of aColumnExpression
(really aname_l
), which needs to know about the expression's dialect. But we do not have a dialect when we construct the level (creators) withComparisonCreator.create_comparison_levels()
.My suggestion is a comparison level
ExactMatchWithLiteralLevel
which would output SQL likecol_l = col_r AND col_l = 'some_literal'
from sayExactMatchWithLiteralLevel("col", literal="some_literal", type="string")
. This could be added as an option toExactMatchLevel
, but I think as that is quite a key level it would be best to avoid complicating it any further, particularly as this might be a little fiddly (dealing with different types, for example). I also think thatExactMatchWithLiteralLevel
has some utility in general - I think it is relatively common for there to be 'special' values that we might want to handle separately.