maxharlow / csvmatch

🔎 Finds fuzzy matches between CSV files
Other
183 stars 22 forks source link

Add field weightings for fuzzy matching, update requirements #19

Closed hmnd closed 5 years ago

hmnd commented 6 years ago
hmnd commented 6 years ago

@maxharlow would you rather I leave out the style changes?

maxharlow commented 6 years ago

Hi @hmnd -- sorry for taking so long to get back to you. Would you mind removing the style changes?

To explain: up until now this has been a personal project -- albeit one I've encouraged others to use -- so it's written it in my own idiosyncratic Python style, which I prefer to Pep8. However, if I start getting more pull requests of substance like yours I'll reconsider this though to ease such contributions.

hmnd commented 6 years ago

No worries :) I've reverted the styling changes.

maxharlow commented 6 years ago

Ok, I've just released v1.18, which refactors the way matchings work. It doesn't include weightings, but it does let you specify a different threshold for each field -- does that work for your use case?

hmnd commented 5 years ago

Sorry for the delay in my reply. Different thresholds per field are still different from weightings. Weightings allow you to create a balance of a number of fields to account for known inaccuracies in data. For instance, in one project, I'm matching on name and address. Since addresses change and may not be correct, I have a balance of 85%/15% between name/address, so a person may not match based on address, but will still match on name. Is that a bit more clear?

maxharlow commented 5 years ago

I appreciate that the two are different concepts, but it does it let you do what you need here?

For the project you described, that might be something like:

 $ csvmatch -1 name address -2 name address -t 0.8 0.1 first.csv second.csv

Apologies if there's some important nuance that I've missed!