Closed RossKen closed 3 months ago
@RossKen given that the output of cumulative_comparisons_generated_by_blocking_rules(linker,blocking_rules)
already produces a row count per blocking rule, any reason we can't add an output_type arg as to whether that's outputted as a dict or a dataframe? Though that feels too simple to need a feature as it's literally just calling pd.DataFrame on the output.
@sama-ds feel free to make a PR that adds that functionality, if you think it'd be useful. As you say, it's just an if + pd.Dataframe wrapper.
Is your proposal related to a problem?
This comes from discussion #1103, where @illeamb asked if there were any functions to look at the distributions/summary stats of block sizes.
It feels like this could be a good diagnostic tool alongside
linker.cumulative_num_comparisons_from_blocking_rules_chart()
Using
cumulative_comparisons_generated_by_blocking_rules(linker,blocking_rules)
already gives a list with the numbers of comparisons so it should be fairly straightforward to get some summary stats out.Describe the solution you'd like
Unsure at this point. Could be a histogram of the distribution of size of blocks or a simple table of summary stats. Either way, getting the result of
cumulative_comparisons_generated_by_blocking_rules(linker,blocking_rules)
into some some tabular format would be a good startDescribe alternatives you've considered
Additional context