Open RobinL opened 1 year ago
Other things that have come to mind:
partial_match_weight_chart
ComparisonLevels
should be called SimilarityCategories
(though I know we've raised this before hereDo we think there may be a way of getting the Linker object to ‘deal with’ the backends automatically.
Suppose we instantiate all linkers like linker = Linker(df, dialect='spark')
I’m wondering two things really:
so user would
from splink.comparison_level_library import else_level
This is passed to Linker and at the point of usage, the Linker sets the dialect on the level to spark
How do we implement generic functions like block_on
so that they can be used internally. At the moment, you can't easily use block_on
internally because you need to choose the correct one. Is there a generic function we could create that would convert a generic function (e.g. from splink.blocking_rule_library import block_on
to from splink.duckdb.blocking_rule_library import block_on
)
Maybe use something:
module_mapping = {
"presto": "splink.athena.blocking_rule_library",
"duckdb": "splink.duckdb.blocking_rule_library",
"postgres": "splink.postgres.blocking_rule_library",
"spark": "splink.spark.blocking_rule_library",
"sqlite": "splink.sqlite.blocking_rule_library",
}
Better (thanks andy!):
from importlib import import_module
brl_module = import_module(f"splink.{dialect}.blocking_rules_library")
block_on = getattr(brl_module, "block_on")
See https://github.com/moj-analytical-services/splink/issues/1669 for some notes on changes to the settings object.
Drop support for the valid_regex_string
argument in our exact match comparisons.
This would not necessarily incorporate an API-breaking change (but probably would/should), but I would be keen to 'do away with' settings_dict
, comparison_dict
and comparison_level_dict
.
What I mean is that:
Settings
object, or Comparison
, or ComparisonLevel
. At the minute internally things are a bit of a mix of these things, or possibly dicts, which I think tends to make the code more confusing, and also can lead to duplication (see e.g. _settings_obj
and settings_dict
in Linker
)Settings
then as I am typing in the arguments my tooling takes care of what these are (and the appropriate types), so I don't have to look up the dictionary spec (is it unique_id_column
? what's that parameter called again?)As well as potentially easier creation, it means we can potentially give users access to helpful methods on these objects (e.g. a nicer way to get a particular comparison level instead of something like settings["comparisons"][2]["comparison_levels"][3]
), which they don't have if they have written things as dict
s. We could of course just make functions for these, but I think having these available as methods means it is right there at user's fingertips (and easier to explore using autocomplete!)
I do understand the value of things being represented simply though (although please let me know if there are aspects I'm overlooking), so I don't suggest doing away with allowing dicts as input/output (especially for saving, sharing etc). But I would like to see e.g. Settings
as the 'single source of truth', with serialisation/deserialisation to json (or python dicts) as useful wrappers, which I think should make things like validation easier (but I haven't worked on that part of the code much so correct me if I'm wrong @ThomasHepworth!). This could even entail using something like marshmallow (admittedly another [not strictly necessary] dependency, but a lightweight one).
Others may have different viewpoints on the more 'philosophical' aspects, but I do think at least keeping things as consistent types internally will make the code a lot easier to deal with.
Related: #1669, #1679, some of the discussion in #1055
Separate point:
This is somewhat related to #1679, and also #1678, but decided it probably makes the most sense to mention here. Apologies if this is mentioned elsewhere, but I wonder if it makes sense to strip out all the sql-execution stuff into an object DB_API
, which has versions for each dialect. Any backend-specific arguments then go through this, and the linker has to go through this to do anything SQL, and can focus purely on constructing the SQL logic that needs to run (or really on delegating that to wherever needs be)
Sample code would be something like:
from splink import DB_API
from splink.dialects import spark_dialect
...
linker = Linker(df, settings, DB_API(spark_dialect, spark=spark, repartition_after_blocking=True))
any actual backend action happens as e.g. (although we don't have to do this - see this comment though):
linker.backend._execute_sql_pipeline()
linker.backend.compute_tf_table("name")
as well as any backend-specific methods:
linker.backend._register_udfs_from_jar()
The settings, and the Linker
itself are completely backend-agnostic, and everything dialect-dependent happens via the DB_API
re "'do away with' settings_dict, comparison_dict and comparison_level_dict."
I think I pretty much agree with all of what you say
I'm still pretty wedded to the concept of a formal schema for serialisation/deserialisation as the kind of 'backbone declarative API' around which Splink is designed. But like you say, I don't think the existence of it (potentially completely behind the scenes) is incompatible with any of these ideas
Here's an idea I'm not necessarily super convinced by, but think it might at least be worth kicking around a little. Thinking about separating comparison levels from how users create them, the possibility of exploiting metadata associated to particular types of comparison level, and tying together the idea of a completely declarative API with the various comparison-level-creator-helper-functions, I wonder about a change to the comparison level portion of the settings schema, along the lines of:
"comparison_levels": [
{"type": "null_level", "column_name": "full_name"},
{"type": "exact_match_level", "column_name": "full_name"},
{"type": "levenshtein_level", "column_name": "full_name", "threshold": 3},
{"type": "custom_level", "sql_expression": "some_fuzzy_func(full_name_l, full_name_r)"},
{"type": "else_level"},
]
The benefit is that this then maps directly to the objects we are dealing with (and may wish to do some magic with), and aligns the approach of 'writing json and feeding it to Splink' with 'writing Python to configure Splink with our convenience functions'.
Of course the json schema then gets quite a bit more complicated this way. It also further abstracts the details of the actual comparison levels away from the user, although I am not sure atm whether or not that is a bad thing. As this potentially could lead to drift between the SQL generated (if we alter the underlying functions for some reason), this might be best accompanied by some kind of versioning on the json-generation side
Been thinking a little more about the backbone declarative API
, and how it relates the 'write everything in python' approach, and taking it a little more seriously. I guess part of the difficulty here is that currently the declarative side only captures part of the story, and you still need to string together a bunch of python methods to run your linking.
But say we had a method where we define everything in JSON (although I am aware of the complications mentioned in #1288). Combining this with a formal schema means that with tooling then (almost) fully authoring a link job in JSON might be straightforward, and perhaps even easier (though perhaps that is a stretch). A couple of potential advantages though:
Here is a sketch of an example. I have written it in YAML as I think that is nicer for reading/writing, and with a concrete schema seems like we could support that along with JSON easily enough. This particular sketch incorporates this idea, including training config, as well a including further metadata, a description of where the data lives, and some potential alterations to the settings schema, but none of these are crucial to the overall concept. In this however I have assumed that the fundamental way Splink works remains the same, so all of these changes are purely in terms of interface / cosmetic. This script (non-functional) shows then how you might then run the job specified by the YAML.
I am not proposing that this would replace the existing API, but be available as an alternative. Both approaches would be tied together and derived from the same source as mentioned a bit in this comment
This issue is just an initial dumping ground for backwards-incompatible changes we might like to make in a prospective Splink 4 release.
note: Splink 4 is not expected to be a re-write, just an opportunity to improve the API in a non-backwards compatible way.
For example, https://github.com/moj-analytical-services/splink/issues/1055: Drop support for instantiating a linker object without providing a settings dictionary
Everything here is just an idea! . There are of course trade offs between improving clarity/ease of use and breaking old code when version is updated. Definitely not suggsting we actually DO all these things!