moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.17k stars 132 forks source link

[MAINT] `linker.py` cleanin/refactoring #1678

Open ThomasHepworth opened 8 months ago

ThomasHepworth commented 8 months ago

Is your proposal related to a problem?

Tags onto this conversation about cleaning up our settings class - https://github.com/moj-analytical-services/splink/issues/1669.

I think we should sit down and work out how we can start to clean up linker.py, which has grown to nearly 4000 lines at this point.

Admittedly, a lot of this comes from docstrings we've put in, however, it feels as if a lot of the code could be migrated away to a new home and then inherited.

Some quick thoughts on what we might want to change/remove:

My view is that the linker class should only contain methods we wish to call and everything else should be housed in their own separate, but sensible areas.

This should make everything easier to find, test and add to going forward.

ADBond commented 8 months ago

Am very much in favour of this. At a bare minimum I think splitting up the module into sensible grouped chunks would be handy, which could be achieved as you say via mixins:

# splink.linker.py
from splink.linker.charts import _LinkerChartsMixin
from splink.linker.charts import _LinkerTrainingMixin

class Linker(_LinkerChartsMixin, _LinkerTrainingMixin):
    ...

However I am interested also in somewhat breaking up the Linker object itself, by nesting methods under related classes which live as attributes on the linker (something me and @zslade discussed a little in regards to #1538). What I mean is something like:

# splink.linker.charts.py

class LinkerCharts:
    def __init__(self, linker: Linker):
        ...

    def unlinkables_chart(self, x_col='match_weight', source_dataset=None, as_dict=False):
         ...

and

# splink.linker.py
from splink.linker.charts import LinkerCharts

class Linker:
    def __init__(self, ...):
        self.charts = LinkerCharts(self)

so that we instead have from a user POV

linker.charts.unlinkables_chart()
# instead of the old way:
linker.unlinkables_chart()

The only things that live directly on the linker are the 'behind-the-scenes' stuff that these other bits of functionality require.

I realise this means some code gets a bit longer, but if you are like me and have difficulty remembering the names of some methods, it allows a way to more quickly narrow down your search when using autocomplete (typing linker.training. will leave you with a lot fewer options than the load you get with the current linker.). It also means:

Of course this is quite a severe backwards-incompatibility #1644, and might require a bit of careful planning about which attribute-classes know about one another to keep functionality. Also have some thoughts re: the SQL execution part, but will post that separately

ADBond commented 8 months ago

Tangentially related, but I think there might also be cases where we want to leverage some of the functionality of the Linker without necessarily needing a 'full one'. For example, in relation to working on #1001 I have found it useful to read in 'edges' and 'clusters' tables, and create a cluster_studio_dashboard. Currently I have to pass in some dummy data to create the linker, as well as a settings dict (which I am not necessarily interested in for my case). I feel like this sort of use-case may not be too unusual, so maybe there is a way to leverage this ☝️ proposed modularity to make things like that easier. Also thinking of things like profiling your data without needing a settings dictionary to do so, without keeping complicated logic / validation in the linker (in the spirit of #1055).

i.e. for the right choice of submodules maybe we can make something like a LinkerProfiling which only has the profiling functionality, without needing settings.

Might all be too complicated for fairly niche use-cases, but perhaps worth thinking about whether or not we can naturally structure things in such a way that we get this sort of thing more-or-less 'for free'

ADBond commented 8 months ago

I've made a mini-working sketch of the database api object suggested in this comment, which would take over some of the functionality currently tackled directly by the linker. I've just focussed on what seems to me the 'main' SQL execution methods (_execute_sql_against_backend, _log_and_run_sql_execution, and _run_sql_execution), and there are a few ideas here - mostly based around the fact that I get a bit confused with the current implementation:

So there are a couple of ideas here - there is certainly a bit of complexity going on in the base class, but my aim was to try and 'soak up' a lot of complexity there, and make things simpler for subclassers / clients of subclasses. Specifically:

There may well be additional complications when it comes to a more full implementation, but hopefully at least some of the ideas here might prove useful.