maxharlow / csvmatch

🔎 Finds fuzzy matches between CSV files
Other
183 stars 22 forks source link

dash character and -a option #28

Closed aborruso closed 5 years ago

aborruso commented 5 years ago

Hi, I have these two input files

Name,Age
Andy,32
Mary-Jane,43

Name,City
Andy,Rome
Mary Jane,New York

If I run

csvmatch -i -a -n input_01.csv input_02.csv --fields1 "Name" --fields2 "Name"

I have

Name,Name
Andy,Andy

Using -a "Mary-Jane" should be equal to "Mary Jane". Dash character is a non-alphanumeric char or not?

Thank you

aborruso commented 5 years ago

If I create this rule.txt file

 $

and run

csvmatch -i -a -n -l rule.txt input_01.csv input_02.csv --fields1 "Name" --fields2 "Name"

I have also Mary Jane

Name,Name
Andy,Andy
Mary-Jane,Mary Jane

But it has no sense for me, because I have added only a $, a white space at the end of sentence.

Thank you

maxharlow commented 5 years ago

Ok, I'm not sure if this is a bug or just something that isn't clear in the documentation.

The reason it happens is that 'ignoring' nonalphanumeric characters means they are removed -- so Mary-Jane becomes MaryJane, which doesn't match. A workaround would be to use something like Levenshtein, which even with a high threshold like 85% should produce a match.

One option to resolve problems like this would be to replace nonalphanumerics with spaces instead of removing them. Of course with cases like Lastname, Firstname, you'd end up with two spaces, but with a flag to ignore repeating whitespace characters (such as your suggestion in https://github.com/maxharlow/csvmatch/issues/29) it could work quite well.

aborruso commented 5 years ago

Hi @maxharlow thank you.

But why does it work with $ in -l file? Why "Mary Jane" matches "MaryJane"?

maxharlow commented 5 years ago

Afraid I wasn't able to replicate that. Would you mind checking again?

aborruso commented 5 years ago

@maxharlow look here http://youtu.be/cUfAunJnUuU?hd=1

My files are

# rule.txt
 $
# input_01.csv
Name,Age
Andy,32
Mary-Jane,43
Andrè,50
#input_02.csv
Name,City
Andy,Rome
Mary Jane,New York
Andre',Palermo
maxharlow commented 5 years ago

How odd. I've tried it with files exactly the same as yours. A single space in rule.txt would make sense, as that would remove the space from the second file, and the -a would remove the hyphen from the first. But with the $, it shouldn't match, and I don't know why it does for you.

aborruso commented 5 years ago

Ok @maxharlow I'm closing, I have had a good reply to my issue question.

Thank you

maxharlow commented 5 years ago

As of v1.19 this should now work as you originally expected

aborruso commented 5 years ago

Wow, I'm very proud also of it :)