moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.4k stars 151 forks source link

Splink 4: Backwards-incompatible API changes #1644

Open RobinL opened 1 year ago

RobinL commented 1 year ago

This issue is just an initial dumping ground for backwards-incompatible changes we might like to make in a prospective Splink 4 release.

note: Splink 4 is not expected to be a re-write, just an opportunity to improve the API in a non-backwards compatible way.

For example, https://github.com/moj-analytical-services/splink/issues/1055: Drop support for instantiating a linker object without providing a settings dictionary

Everything here is just an idea! . There are of course trade offs between improving clarity/ease of use and breaking old code when version is updated. Definitely not suggsting we actually DO all these things!

RobinL commented 1 year ago

Other things that have come to mind:

RobinL commented 1 year ago

Do we think there may be a way of getting the Linker object to ‘deal with’ the backends automatically.

Suppose we instantiate all linkers like linker = Linker(df, dialect='spark') I’m wondering two things really:

so user would from splink.comparison_level_library import else_level

This is passed to Linker and at the point of usage, the Linker sets the dialect on the level to spark

How do we implement generic functions like block_on so that they can be used internally. At the moment, you can't easily use block_on internally because you need to choose the correct one. Is there a generic function we could create that would convert a generic function (e.g. from splink.blocking_rule_library import block_on to from splink.duckdb.blocking_rule_library import block_on)

Maybe use something:

 module_mapping = {
        "presto": "splink.athena.blocking_rule_library",
        "duckdb": "splink.duckdb.blocking_rule_library",
        "postgres": "splink.postgres.blocking_rule_library",
        "spark": "splink.spark.blocking_rule_library",
        "sqlite": "splink.sqlite.blocking_rule_library",
    }

Better (thanks andy!):

from importlib import import_module

brl_module = import_module(f"splink.{dialect}.blocking_rules_library")
block_on = getattr(brl_module, "block_on")
ThomasHepworth commented 1 year ago

See https://github.com/moj-analytical-services/splink/issues/1669 for some notes on changes to the settings object.

ThomasHepworth commented 1 year ago

Drop support for the valid_regex_string argument in our exact match comparisons.

ADBond commented 1 year ago

This would not necessarily incorporate an API-breaking change (but probably would/should), but I would be keen to 'do away with' settings_dict, comparison_dict and comparison_level_dict.

What I mean is that:

As well as potentially easier creation, it means we can potentially give users access to helpful methods on these objects (e.g. a nicer way to get a particular comparison level instead of something like settings["comparisons"][2]["comparison_levels"][3]), which they don't have if they have written things as dicts. We could of course just make functions for these, but I think having these available as methods means it is right there at user's fingertips (and easier to explore using autocomplete!)

I do understand the value of things being represented simply though (although please let me know if there are aspects I'm overlooking), so I don't suggest doing away with allowing dicts as input/output (especially for saving, sharing etc). But I would like to see e.g. Settings as the 'single source of truth', with serialisation/deserialisation to json (or python dicts) as useful wrappers, which I think should make things like validation easier (but I haven't worked on that part of the code much so correct me if I'm wrong @ThomasHepworth!). This could even entail using something like marshmallow (admittedly another [not strictly necessary] dependency, but a lightweight one).

Others may have different viewpoints on the more 'philosophical' aspects, but I do think at least keeping things as consistent types internally will make the code a lot easier to deal with.

Related: #1669, #1679, some of the discussion in #1055

ADBond commented 1 year ago

Separate point:

This is somewhat related to #1679, and also #1678, but decided it probably makes the most sense to mention here. Apologies if this is mentioned elsewhere, but I wonder if it makes sense to strip out all the sql-execution stuff into an object DB_API, which has versions for each dialect. Any backend-specific arguments then go through this, and the linker has to go through this to do anything SQL, and can focus purely on constructing the SQL logic that needs to run (or really on delegating that to wherever needs be) Sample code would be something like:

from splink import DB_API
from splink.dialects import spark_dialect

...
linker = Linker(df, settings, DB_API(spark_dialect, spark=spark, repartition_after_blocking=True))

any actual backend action happens as e.g. (although we don't have to do this - see this comment though):

linker.backend._execute_sql_pipeline()
linker.backend.compute_tf_table("name")

as well as any backend-specific methods:

linker.backend._register_udfs_from_jar()

The settings, and the Linker itself are completely backend-agnostic, and everything dialect-dependent happens via the DB_API

RobinL commented 1 year ago

re "'do away with' settings_dict, comparison_dict and comparison_level_dict."

I think I pretty much agree with all of what you say

I'm still pretty wedded to the concept of a formal schema for serialisation/deserialisation as the kind of 'backbone declarative API' around which Splink is designed. But like you say, I don't think the existence of it (potentially completely behind the scenes) is incompatible with any of these ideas

ADBond commented 1 year ago

Here's an idea I'm not necessarily super convinced by, but think it might at least be worth kicking around a little. Thinking about separating comparison levels from how users create them, the possibility of exploiting metadata associated to particular types of comparison level, and tying together the idea of a completely declarative API with the various comparison-level-creator-helper-functions, I wonder about a change to the comparison level portion of the settings schema, along the lines of:

"comparison_levels": [
    {"type": "null_level", "column_name": "full_name"},
    {"type": "exact_match_level", "column_name": "full_name"},
    {"type": "levenshtein_level", "column_name": "full_name", "threshold": 3},
    {"type": "custom_level", "sql_expression": "some_fuzzy_func(full_name_l, full_name_r)"},
    {"type": "else_level"},
]

The benefit is that this then maps directly to the objects we are dealing with (and may wish to do some magic with), and aligns the approach of 'writing json and feeding it to Splink' with 'writing Python to configure Splink with our convenience functions'.

Of course the json schema then gets quite a bit more complicated this way. It also further abstracts the details of the actual comparison levels away from the user, although I am not sure atm whether or not that is a bad thing. As this potentially could lead to drift between the SQL generated (if we alter the underlying functions for some reason), this might be best accompanied by some kind of versioning on the json-generation side

ADBond commented 1 year ago

Been thinking a little more about the backbone declarative API, and how it relates the 'write everything in python' approach, and taking it a little more seriously. I guess part of the difficulty here is that currently the declarative side only captures part of the story, and you still need to string together a bunch of python methods to run your linking. But say we had a method where we define everything in JSON (although I am aware of the complications mentioned in #1288). Combining this with a formal schema means that with tooling then (almost) fully authoring a link job in JSON might be straightforward, and perhaps even easier (though perhaps that is a stretch). A couple of potential advantages though:

Here is a sketch of an example. I have written it in YAML as I think that is nicer for reading/writing, and with a concrete schema seems like we could support that along with JSON easily enough. This particular sketch incorporates this idea, including training config, as well a including further metadata, a description of where the data lives, and some potential alterations to the settings schema, but none of these are crucial to the overall concept. In this however I have assumed that the fundamental way Splink works remains the same, so all of these changes are purely in terms of interface / cosmetic. This script (non-functional) shows then how you might then run the job specified by the YAML.

I am not proposing that this would replace the existing API, but be available as an alternative. Both approaches would be tied together and derived from the same source as mentioned a bit in this comment