moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.17k stars 131 forks source link

Allow `__splink__df_concat` to be computed without `linker` #2142

Open RobinL opened 2 months ago

RobinL commented 2 months ago

We plan to allow the user to do some forms of exploratory analysis without needing to create a linker (like profile_columns and various types of blocking analysis e.g. #2136 )

But this means that __splink__df_concat needs to be computed without the linker.

At the moment, this requires a lot of code that's confusing to read and will be repetitive:

https://github.com/moj-analytical-services/splink/blob/66ec54f0f114cf3eda20ea7fe9e05ccfff2c584c/splink/profile_data.py#L237-L267

Issue can be addressed by removing the need for a linker to compute __splink__df_concat, giving us reusable code that can be used for profile_columns, blocking analysis etc.

RobinL commented 2 months ago

We want vertically_concatenate_sql to be modified so it doesn't take a linker as an argument, but the functions like vertically_concatenate.compute_df_concat shuld still take the linker as an argument