moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.36k stars 148 forks source link

[FEAT] Add summary statistics for blocks #1106

Closed RossKen closed 3 months ago

RossKen commented 1 year ago

Is your proposal related to a problem?

This comes from discussion #1103, where @illeamb asked if there were any functions to look at the distributions/summary stats of block sizes.

It feels like this could be a good diagnostic tool alongside linker.cumulative_num_comparisons_from_blocking_rules_chart()

Using cumulative_comparisons_generated_by_blocking_rules(linker,blocking_rules) already gives a list with the numbers of comparisons so it should be fairly straightforward to get some summary stats out.

Describe the solution you'd like

Unsure at this point. Could be a histogram of the distribution of size of blocks or a simple table of summary stats. Either way, getting the result of cumulative_comparisons_generated_by_blocking_rules(linker,blocking_rules) into some some tabular format would be a good start

Describe alternatives you've considered

Additional context

sama-ds commented 1 year ago

@RossKen given that the output of cumulative_comparisons_generated_by_blocking_rules(linker,blocking_rules) already produces a row count per blocking rule, any reason we can't add an output_type arg as to whether that's outputted as a dict or a dataframe? Though that feels too simple to need a feature as it's literally just calling pd.DataFrame on the output.

ThomasHepworth commented 1 year ago

@sama-ds feel free to make a PR that adds that functionality, if you think it'd be useful. As you say, it's just an if + pd.Dataframe wrapper.

sama-ds commented 1 year ago

WIP Pull Request

RobinL commented 3 months ago

This is in Splink 4 here: https://moj-analytical-services.github.io/splink/api_docs/blocking_analysis.html#splink.blocking_analysis.n_largest_blocks