rufuspollock-okfn / reconcile-csv

A simple OpenRefine reconciliation service that runs on top of a CSV file
BSD 2-Clause "Simplified" License
117 stars 28 forks source link

Everything gets matched against first line of CSV #23

Open NumerousHats opened 9 years ago

NumerousHats commented 9 years ago

I have a canonical list of law firm names, and I am trying to fuzzy match them against a column of messy, user-generated names in OpenRefine.

All seems to work with no errors, except that every single name in the OpenRefine column appears to be matching to the first line of the canonical list, even though there are exact matches present.

Here is the CSV file being read into reconcile-csv:

firm,key
Aaronson Rappaport,1
Adams Reese,2
Adelson Testan,3
Adler Pollock,4
Ahlers Cooney,5
Ahmuty Demers,6
Akerman,7
Akin Gump,8
Allen Kopet,9
Allen Matkins,10
Alston Bird,11
Alston Hunt,12
Alvarado Smith,13
Anderson Kill,14
Andrews Kurth,15
Archer Greiner,16
Archer Norris,17
Arent Fox,18
Armstrong Teasdale,19
Arnall Golden,20
Arnold Porter,21
Arnstein Lehr,22
Arthur Chapman,23

and here is some made-up data that I have in OpenRefine:

Akerman
Akin Gump Something Something Else
Whatsa
Allen Thingy
Alston Bird
Alston Hunter
Alvarado Gracioso
Anderson Killer
Andrews Girth
Archer Greiner
Archer Norris Joe & Bob
Aberrant Fox
Armstrong Teasdale
Arnall Golden Dawn
Arnold Porter

As you can see, this contains exact matches, various misspellings, "extra text", and complete non-matches.

I started reconcile-csv as java -Xmx2g -jar reconcile-csv-0.1.2.jar canonical.csv firm key, and after adding the local reconciliation service and running the reconciliation, the result looks like this:

untitled

It looks like everything is matching to "Aaronson Rappaport" (the first line of the CSV file). Is this a bug, or am I doing something stupid?

mihi-tr commented 9 years ago

This looks like an interesting but. As if reconcile can only read the first line. Which plattform are you running on?

NumerousHats commented 9 years ago

MacOS 10.10.5 (corrected typo: I initially wrote 10.5.5 as a finger-slip)