Open ThomasHepworth opened 8 months ago
Am very much in favour of this. At a bare minimum I think splitting up the module into sensible grouped chunks would be handy, which could be achieved as you say via mixins:
# splink.linker.py
from splink.linker.charts import _LinkerChartsMixin
from splink.linker.charts import _LinkerTrainingMixin
class Linker(_LinkerChartsMixin, _LinkerTrainingMixin):
...
However I am interested also in somewhat breaking up the Linker
object itself, by nesting methods under related classes which live as attributes on the linker (something me and @zslade discussed a little in regards to #1538). What I mean is something like:
# splink.linker.charts.py
class LinkerCharts:
def __init__(self, linker: Linker):
...
def unlinkables_chart(self, x_col='match_weight', source_dataset=None, as_dict=False):
...
and
# splink.linker.py
from splink.linker.charts import LinkerCharts
class Linker:
def __init__(self, ...):
self.charts = LinkerCharts(self)
so that we instead have from a user POV
linker.charts.unlinkables_chart()
# instead of the old way:
linker.unlinkables_chart()
The only things that live directly on the linker are the 'behind-the-scenes' stuff that these other bits of functionality require.
I realise this means some code gets a bit longer, but if you are like me and have difficulty remembering the names of some methods, it allows a way to more quickly narrow down your search when using autocomplete (typing linker.training.
will leave you with a lot fewer options than the load you get with the current linker.
).
It also means:
Linker
linker.charts.available_charts()
or w/e Of course this is quite a severe backwards-incompatibility #1644, and might require a bit of careful planning about which attribute-classes know about one another to keep functionality. Also have some thoughts re: the SQL execution part, but will post that separately
Tangentially related, but I think there might also be cases where we want to leverage some of the functionality of the Linker
without necessarily needing a 'full one'.
For example, in relation to working on #1001 I have found it useful to read in 'edges' and 'clusters' tables, and create a cluster_studio_dashboard
. Currently I have to pass in some dummy data to create the linker, as well as a settings dict (which I am not necessarily interested in for my case). I feel like this sort of use-case may not be too unusual, so maybe there is a way to leverage this ☝️ proposed modularity to make things like that easier.
Also thinking of things like profiling your data without needing a settings dictionary to do so, without keeping complicated logic / validation in the linker (in the spirit of #1055).
i.e. for the right choice of submodules maybe we can make something like a LinkerProfiling
which only has the profiling functionality, without needing settings
.
Might all be too complicated for fairly niche use-cases, but perhaps worth thinking about whether or not we can naturally structure things in such a way that we get this sort of thing more-or-less 'for free'
I've made a mini-working sketch of the database api object suggested in this comment, which would take over some of the functionality currently tackled directly by the linker.
I've just focussed on what seems to me the 'main' SQL execution methods (_execute_sql_against_backend
, _log_and_run_sql_execution
, and _run_sql_execution
), and there are a few ideas here - mostly based around the fact that I get a bit confused with the current implementation:
So there are a couple of ideas here - there is certainly a bit of complexity going on in the base class, but my aim was to try and 'soak up' a lot of complexity there, and make things simpler for subclassers / clients of subclasses. Specifically:
_log_and_run_sql_execution
now happens automatically with some decorator/metaclass stuff, as this method is directly tied to _run_sql_execution
, to hopefully make it clearer which methods should be called_execute_sql_against_backend
is not directly overridden, but broken up into setup/cleanup functions which are - so that subclassers don't have to directly touch any of the methods that actually run SQL, partly so there is no confusion between the 'bare' method and the 'wrapped' version above [not sure if this plays well with athena]db_api("duckdb")("dbfile.db")
- the double brackets is not so nice, but is mainly done to allow help with tooltips for the varying arguments you will get depending on your flavour.There may well be additional complications when it comes to a more full implementation, but hopefully at least some of the ideas here might prove useful.
Is your proposal related to a problem?
Tags onto this conversation about cleaning up our settings class - https://github.com/moj-analytical-services/splink/issues/1669.
I think we should sit down and work out how we can start to clean up linker.py, which has grown to nearly 4000 lines at this point.
Admittedly, a lot of this comes from docstrings we've put in, however, it feels as if a lot of the code could be migrated away to a new home and then inherited.
Some quick thoughts on what we might want to change/remove:
My view is that the linker class should only contain methods we wish to call and everything else should be housed in their own separate, but sensible areas.
This should make everything easier to find, test and add to going forward.