Open vviers opened 1 year ago
@dsagal wrote twowaymap, I think you're seeing https://github.com/gristlabs/grist-core/commit/b82eec714a2e3e34f55627f3c94f5ebef11dede8 in the git blame which was when Paul moved a whole bunch of files.
Something like this is possible, but doing it in a scalable and efficient way is more complicated and maybe not possible at all, because twowaymap (and dicts/sets in general) rely on hashing for efficiency. I'd be very nervous about letting users naively add FUZZY
to a formula and turn lookupRecords
from an O(1) operation to an O(n) operation.
Here's an attempt to do some very crude fuzzy lookups that shouldn't be too slow, assuming that strings aren't very long and only differ by one or two characters: https://public.getgrist.com/skpYXCTA9FQR/Fuzzy-lookup
Context
A recurring need for us at ANCT is being able to match values that are "close enough". Often when importing pre-existing Excel files into Grist, we end up doing a lot of data-cleaning and having Grist be able to "fuzzy-match" stuff (up to a pre-determined level of acceptable fuzziness) would be incredibly useful.
We have built some fuzzy-matching custom widgets (for example, this one that tries to disambiguate city names and retrieve their statistical city code) and I'm now using the Jupyter custom widget (yay, it's so cool !) along with the
fuzzywuzzy
python package for some of my use-cases. However I still believe that Grist itself could benefit from this behavior :lookup
functionsFor now, I think focusing on 1. would be more than enough as a proof of concept.
Proposed implementation and strategy
I would like to introduce the
FUZZY
function based on the Levenshtein distance that would work a little bit like theCONTAINS
marker. It would take a minimal similarity ratio as a parameter. It would make sense to me that it could be used with lookupRecords in conjonction withCONTAINS
.Looking at the code, it seems that most of the heavy lifting required to make this work would be in the twowaymap module : https://github.com/gristlabs/grist-core/blob/main/sandbox/grist/twowaymap.py. @paulfitz you seem to be the one who wrote most of this part of the code, any intuition as to whether this sounds feasible ?
I found this
fuzzymap
project that might provide inspiration for an implementation.