gristlabs / grist-core

Grist is the evolution of spreadsheets.
https://www.getgrist.com
Apache License 2.0
7.29k stars 323 forks source link

Fuzzy-matching in Grist #766

Open vviers opened 1 year ago

vviers commented 1 year ago

Context

A recurring need for us at ANCT is being able to match values that are "close enough". Often when importing pre-existing Excel files into Grist, we end up doing a lot of data-cleaning and having Grist be able to "fuzzy-match" stuff (up to a pre-determined level of acceptable fuzziness) would be incredibly useful.

We have built some fuzzy-matching custom widgets (for example, this one that tries to disambiguate city names and retrieve their statistical city code) and I'm now using the Jupyter custom widget (yay, it's so cool !) along with the fuzzywuzzy python package for some of my use-cases. However I still believe that Grist itself could benefit from this behavior :

  1. when using the lookup functions
  2. when setting up a column into a Reference Column
  3. (less sure about this) when importing a document into a Grist document using "update existing records"

For now, I think focusing on 1. would be more than enough as a proof of concept.

Proposed implementation and strategy

I would like to introduce the FUZZY function based on the Levenshtein distance that would work a little bit like the CONTAINS marker. It would take a minimal similarity ratio as a parameter. It would make sense to me that it could be used with lookupRecords in conjonction with CONTAINS.

# only return something if it has a 90% similarity ratio, default to returning the best match if several are above this threshold
Table.lookupOne(food=FUZZY($favorite_meal, min_similarity=.9))

# only return if at least one element in the list has a 90% similarity ratio
Table.lookupRecords(food=CONTAINS(FUZZY($favorite_meal, .9)))

Looking at the code, it seems that most of the heavy lifting required to make this work would be in the twowaymap module : https://github.com/gristlabs/grist-core/blob/main/sandbox/grist/twowaymap.py. @paulfitz you seem to be the one who wrote most of this part of the code, any intuition as to whether this sounds feasible ?

I found this fuzzymap project that might provide inspiration for an implementation.

alexmojaki commented 1 year ago

@dsagal wrote twowaymap, I think you're seeing https://github.com/gristlabs/grist-core/commit/b82eec714a2e3e34f55627f3c94f5ebef11dede8 in the git blame which was when Paul moved a whole bunch of files.

Something like this is possible, but doing it in a scalable and efficient way is more complicated and maybe not possible at all, because twowaymap (and dicts/sets in general) rely on hashing for efficiency. I'd be very nervous about letting users naively add FUZZY to a formula and turn lookupRecords from an O(1) operation to an O(n) operation.

alexmojaki commented 1 year ago

Here's an attempt to do some very crude fuzzy lookups that shouldn't be too slow, assuming that strings aren't very long and only differ by one or two characters: https://public.getgrist.com/skpYXCTA9FQR/Fuzzy-lookup