kentcdodds / match-sorter

Simple, expected, and deterministic best-match sorting of an array in JavaScript
https://npm.im/match-sorter
MIT License
3.73k stars 129 forks source link

feat: support finding data with typos #153

Closed overengineered closed 3 days ago

overengineered commented 1 week ago

What: Increases tolerance for typos in data and/or query text

Why: match sorter currently finds data with some typos, but others. E.g. "canceled" would find "cancelled", but not "cacneled".

How: getClosenessRanking calculator allows skipping one of the characters from the query. This allows finding data that contains the most common typos: swapped letters, missing letters and diverging spelling (localisation/localization). Matches that have missing letters have reduced ranking.

Checklist:

Since this change only affects searches where no good match (like full word, or exact start or word) gets found, I think there's no need to have add to options some flag to disable/enable this feature. If searching ua and finding United States of America is good enough right now, then my changes that improve tolerance for typos should be also acceptable.

But I can see that there could be some value in providing configuration for this new feature - and I am open to do it.

overengineered commented 1 week ago

I think ranking level is not really appropriate for controlling this functionality. As an example for query defer with proposed typo tolerance the following matches are ordered like this:

lawyer defends victims United States of America indestructible dog crate

All those results are not good matches for the query, but when ordering for similarity to defer in my mind this order seems correct. Only the second match would be findable without typo tolerance. Essentially, typo tolerance adds both "better quality" results and "lower quality" results. So it seems that it would be best for typo tolerance to be in MATCHES ranking and let scoring function do its work.

In my view the MATCHES ranking is for results where no good matches were found and we are scraping the bottom of the barrel to see if something might be similar. Even current approach for MATCHES ranking can find lots of weird matches - and proposed changes do not drastically change behaviour. If spurious matches are not appropriate, increasing threshold seems like the best way to turn it off.

But if this new matching should be controlled separately, I would do it like this:

  interface MatchSorterOptions<ItemType = unknown> {
    <..>
    sorter?: Sorter<ItemType>
+   fuzzy?: 'scattered' | 'partial'
  }

I think true / false do work here as current MATCHES behaviour is an already little bit fuzzy. I think these constants fittingly describe behaviour differences.

overengineered commented 1 week ago

I just had an idea to allow typo tolerance with ranking like this:

- MATCHES
+ TIGHT_PARTIAL
+ SCATTERED

Basically typo tolerant matching would be higher than the current MATCHES matching. However my proposed code change would have to be modified so that it allowed skipping letters only if matching letters are packed tightly. I would imagine restricting finding all the letters except one within a span that is just a few letters longer than initial query would almost always be higher quality match than finding all letters, but widely scattered.

kentcdodds commented 3 days ago

I think you're right. This is fine. If people find it's problematic then we can ship a breaking change with new levels.

github-actions[bot] commented 3 days ago

:tada: This PR is included in version 6.4.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

diegohaz commented 3 days ago

FYI, this change caused one of our automated tests to fail: https://github.com/ariakit/ariakit/pull/4218/files#diff-ffb64734a719f82219f5f82d3ee8c52e111c4a560940b6bb14b60ab6d3e74de0

This isn't a problem. I'll update the test. The UI works well, and I believe it's a good feature. I'm sharing this so you might consider making it optional to avoid any friction until the next major.

kentcdodds commented 3 days ago

@diegohaz, I think I'll just do a major version bump and deprecate 6.4.0

kentcdodds commented 3 days ago

I've deprecated 6.4.0 and released 7.0.0: https://github.com/kentcdodds/match-sorter/releases/tag/v7.0.0