tc39 / ecma402

Status, process, and documents for ECMA 402
https://tc39.es/ecma402/
Other
540 stars 108 forks source link

Add locale sensitive substring matching functions to Intl.Collator #506

Open devongovett opened 4 years ago

devongovett commented 4 years ago

It would be useful for Intl.Collator to support some additional methods for locale sensitive substring matching. For example startsWith, endsWith, includes, indexOf, etc. These methods exist on String.prototype, but do not support the options that Intl.Collator provides for case, diacritic, and punctuation insensitive searching.

It is possible to implement these on top of the existing collator compare function, but it is challenging and potentially hacky to do so correctly. I think a built in solution in the language would make sense for a lot of usecases, e.g. client side filtering of lists.

Has there been a proposal for anything like this already?

sffc commented 4 years ago

What you can do today:

  1. Use Intl.Segmenter to get grapheme cluster boundaries
  2. Use regular expressions with Unicode properties to remove punctuation characters
  3. Use String.prototype.toLocaleLowerCase to do poor-man's case-insensitive searching

What you can't currently do:

  1. Proper Unicode case folding; tracked in #99
  2. NFKC/NFKD normalization (not currently tracked anywhere)
leidegre commented 4 years ago

We need some reasonable way to do simple locale sensitive string search. Intl.Segmenter ain't it. Regular expressions are useful but comparatively expensive and there's no RegExp.escape (this proposal was rejected). There's no simple, safe and reliable way to construct a case insensitive regular expression for searching (it is also slower due to the added complexity of regular expressions).

Are all well known names for functions that have lots of utility but cannot be reasonably implemented using just compare.

devongovett commented 4 years ago

You can kinda do it with Intl.Collator already with some very careful slicing. But you have to make sure to normalize both strings the same way first. And if you want to skip punctuation it's more complicated. Here's a library that attempts to implement that: https://github.com/arty-name/locale-index-of.

So while it's possible, I think it would be really nice to include it in the language. This would also make it more likely that developers implement these functions correctly as opposed to naively.

sffc commented 4 years ago

Sure; this sounds like good material for a proposal.

Would you mind sharing some code snippets to help people get a better idea? A list of function names is good, but actual code snippets are better.

devongovett commented 4 years ago

Sure. My main use cases are around filtering a list of items displayed in a user interface as the user types. For example, an autocomplete input or filterable table. For these usecases, a fuzzier search is required rather than an exact match. We still want to show results even if the user does not type in exactly the same case or diacritics as the items being filtered.

Here's an example of what I'd like to be able to do. Using includes here but startsWith and endsWith are equally as useful.

let items = ['Offenbach', 'Österreich', 'Aardvark', 'Kangaroo'];
let collator = new Intl.Collator('fr', {usage: 'search', sensitivity: 'base'});
let search = 'o';
let matches = items.filter(v => collator.includes(v, search));
// => ['Offenbach', 'Österreich', 'Kangaroo']

In some cases it may also be useful to know the range of the match. For example, perhaps the application wishes to highlight the matched string in the UI. In this case, indexOf may not be enough because the range that matched may have a different length than the original string. Perhaps a search method could be introduced which returns an object with both the starting and ending positions, or the starting position and a length. Or maybe this is an iterator for looping through all of the matches in a string.

let items = ['Offenbach', 'Österreich', 'Aardvark', 'Kangaroo'];
let collator = new Intl.Collator('fr', {usage: 'search', sensitivity: 'base'});
let search = 'a';
let matches = items
  .map(v => ({item: v, matches: [...collator.search(v, search)]}))
  .filter(m => m.matches.length > 0);
// => 
// [
//   {item: 'Offenbach', matches: [{index: 6, length: 1}],
//   {item: 'Aardvark', matches: [{index: 0, length: 1}, {index: 1, length: 1}, {index: 5, length: 1}],
//   {item: 'Kangaroo', matches: [{index: 1, length: 1}, {index: 4, length: 1}]
// ]
leidegre commented 4 years ago

I've been here before but this was today's reason (there's some TypeScript in there but mostly it just JavaScript).

// this is a React hook
function useFilter(): [string, (filter: string) => void, RegExp] {
  const [filter, setFilter] = React.useState("")

  const debounced = useDebounce(filter)

  const pattern = React.useMemo(() => {
    return new RegExp(debounced.replace(/[^\w]/g, "\\$&"), "i") // not too happy about this
  }, [debounced])

  return [filter, setFilter, pattern]
}
const COLLATOR = new Intl.Collator(undefined, { sensitivity: "base" }) // ignore case, ignore accent

// this is some render code, what I want to do here is just filter the list of stuff based on the filter
const [filter, setFilter, filterPattern] = useFilter()

data
  ?.filter((d) => filterPattern.test(d.v[0]))
  .sort((a, b) => COLLATOR.compare(a.v[0], b.v[0]))
  .slice(0, 50)
  .map((d) => {
    const key = encodeEntityId(d.e);
    return (
      <li key={key}>
        <NavLink to={`/property/${key}`}>{d.v[0]}</NavLink>
      </li>
    );
  }) ?? null;

What I would like to be able to do here is to just use the COLLATOR instance and something like includes. I would prefer to be able to do COLLATOR.includes(d.v[0], filter) over filterPattern.test(d.v[0]). Now, includes(...) is just a more convenient form of indexOf(...) !== -1 which to me is okay. indexOf can also be used in various substring matching algorithms. For example,

export type Pattern = string[]

const ANY: Pattern = []

export function createPattern(pattern: string, wildcard = "*"): Pattern {
  if (pattern === wildcard) {
    return ANY
  }
  return pattern.split(wildcard)
}

export function isMatch(s: string, pattern: string | Pattern): boolean {
  const p: Pattern = typeof pattern === "string" ? createPattern(pattern) : pattern

  if (p === ANY) {
    return true
  }

  if (p.length === 1) {
    return p[0] === s
  }

  if (!s.startsWith(p[0])) {
    return false
  }

  if (!s.endsWith(p[p.length - 1])) {
    return false
  }

  return _isMatch(s, p[0].length, p, 1, p.length - 1)
}

function _isMatch(s: string, i: number, pattern: Pattern, j: number, k: number): boolean {
  const p = pattern[j]

  if (!(i + p.length <= s.length)) {
    return false
  }

  if (j === k) {
    return true
  }

  const x = s.indexOf(p, i)

  if (x === -1) {
    return false
  }

  if (_isMatch(s, x + p.length, pattern, j + 1, k)) {
    return true
  }

  return _isMatch(s, i + 1, pattern, j, k)
}

The above subsequence matching algorithm leverages startsWith, endsWith and indexOf to figure out if a*b*c matches abc. It would be really awesome if we could have these functions available on the collator. includes isn't strictly necessary but startsWith, endsWith and indexOf are.


I would just assume that these would be available as defined today on the String prototype but on the Collator prototype with the corresponding identical behavior.

I don't believe includes and lastIndexOf or other less common string utility functions are as essential as the ones above.

sffc commented 4 years ago

There is substantial prior art for Unicode-aware fuzzy text matching. UAX 15 does some discussion, as does W3C. In 402 we should try not to reinvent the wheel.

CC @aphillips @markusicu

leidegre commented 4 years ago

"Unicode-aware fuzzy text matching"

There's nothing fuzzy about what's going on here. This is simply about being able to search for substrings using the Intl.Collator. I am not advocating for anything like isMatch actually being available on the Intl.Collator it is just an example of what can be made using startsWith, endsWith and indexOf.

devongovett commented 4 years ago

ICU has StringSearch which looks like it uses a collator internally to implement this. See also: http://userguide.icu-project.org/collation/icu-string-search-service and http://www.unicode.org/reports/tr10/#Searching

leidegre commented 4 years ago

I would like to add that I'm trying to build a database in JavaScript and having more fine grained control over string operations to build case insensitive (or with respect to a specific collation) text indexes is currently not impossible (or extremely inefficient). I don't believe that any of the ongoing standards work here is going to help with that nor do I believe that they will be able to address the problem in the context of my domain.

I don't know how I would move this forward but having, at the very least indexOf exposed by the Intl.Collator instance goes a long way to solve a lot of the problems I'm currently facing with being able to use the Intl.Collator APIs to solve actual problems in real world applications.

markusicu commented 4 years ago

Note that this sort of feature needs a fair bit more code than just a Collator, and you can't just build it on top of compare(). You basically need to incrementally map appropriate sequences of characters to sequences of collation elements and match the desired parts of those (depending on the collation strength) to the collation elements for the pattern. There are options for matching behavior (asymmetric search is somewhat popular for handling diacritics) and for which match boundaries to return when there are multiple possibilities, related to Unicode normalization and/or word boundaries.

hsivonen commented 2 years ago

When proposing the scope for ICU4X collator MVP, I looked at what Firefox does for ctrl-f (not collator-based) and what CLDR search collations accomplish compared to that.

I think exposing a search-specific API that's based on fold case, ignoring diacritics, and a bunch of special cases to accomplish other things—these appear surprisingly few—that CLDR search collations accomplish (but Firefox doesn't) is seriously worth considering compared to adding a collation-based search API: The main concern of collation is to establish an ordering that is appropriate according to human expectations. However, for search, only matches need to be appropriate according to human expectations and, to the extent there is internally a concept of "less" or "greater", the order is an implementation detail.

As a result, the key design driver for what collation data needs to be like is not relevant to search. At least when the search collations are stored as separate instances of data, the cost/benefit of collation-based search (when considering the data size a cost) looks like something that could be improved upon considerably by a search-specific design. Also, (anecdotally by visual observation of ctrl-f on a large HTML file in Firefox vs. Chrome) not having to map the haystack to collation elements (fold case is a simpler transformation) seems beneficial for performance.

leidegre commented 2 years ago

Can someone explain to me what the corner cases are here...

I want to do a simple prefix search. I can do this by simply chopping strings with .slice() and the Intl.Collator just fine. The problem with .slice() is that it is suboptimal because each slice will be an additional allocation (this will slow down searching considerably). Why can't we just expose sensible utility functions like startsWith and indexOf directly on the Intl.Collator that just does what the builtin JavaScript string functions do. It's just UTF-16 code points, the only difference is that that their equivalence relation is locale dependent which is exactly why we have a Intl.Collator, no?

I get that a code point is not a "letter" but the user submitted string will not be broken up at arbitrary code points. Even if a malicious user crafted some nasty string of code points I am perfectly okay with them ending up with a nonsense search result.

arty-name commented 2 years ago

The issue title and description talk about Intl.Collator, and I feel that the discussion focuses a lot on whether Intl.Collator is an appropriate place for such APIs.

I understand that searching and collating are different tasks. Would this topic become less controversial, if instead of asking for new methods on Intl.Collator we would rather consider String.prototype?

We already have locale-aware String.prototype.localeCompare and a few other string methods with locale in their names. What if we added String.prototype.localeIndexOf as my library locale-index-of does? How about String.prototype.localeIncludes?

This way we could easily lose all objections about collation being different and irrelevant to the topic. Then we could focus on the goals of developers, like "I have to find the position of a substring in a text, while smartly ignoring case, diacritics, and punctuation".

leidegre commented 2 years ago

consider String.prototype?

While convenient, I think this is the wrong approach because each locale specific method now has to go through a lookup phase to find a "collation".

The search module from the ICU library appears to be doing exactly this, it can do indexOf and lastIndexOf using a specific collation.

Should we introduce Intl.Search then? Rather just add, indexOf and lastIndexOf to Intl.Collator

johnfactotum commented 2 years ago

Can someone explain to me what the corner cases are here...

I want to do a simple prefix search. I can do this by simply chopping strings with .slice() and the Intl.Collator just fine.

Conceptually,slice() is a segmenter. It's just a non-locale-aware one (or rather, not even unicode-aware). So I think if you're using slice() with the intention of doing something locale-dependent later, usually you'd be better off using Intl.Segmenter instead.

For example, suppose you want to tell whether f̷̮̺̻̞̭͉̖̳̲͓͕̲̞̎͌͗͌̾̓͐̌̆ȯ̸̢̩̇͊͐̅͠ö̷͕̲̜̗̻̤̲̖̘̬̳́̂͊̎̂́̌̆̏̎͛͊̓͘͜͝b̶̥̫̫͖̖̤̹̱͋̈́̓̒́̀̅̿̈́͊͜͠ą̸̪͉͓̲͚̠͕̪̿͊̒̔̈̒͑̕͜ͅr̸͇̖͉̭̀̐̇̂́͒b̴̨̢̫̪͇̗̞̮̣̃̃͗̿̕͝a̴̱̜͍͚̻̳͎̩̫͎͚͍̩̓̀͗̿̀̂̿̽͑̂̇̓́͘̚͠z̵͎̣̓̔̇͋͐ starts with foobar or foobaz using a collator. With slice(), either you give up and accept that the it doesn't start with foobar, or you basically have to loop through the whole string to be sure that it doesn't start with foobaz. With Intl.Segmenter, you just take and compare the first six segments at the start of the string. Even better, since Intl.Segmenter supports different granularities, you can match whole words (which is a very common use case) by simply changing the granularity. You get this pretty much "for free" without having to change anything else in your code.

However, in many cases, even Intl.Segmenter is not good enough because it does not segment text in a collation-aware manner. For example, It's not unreasonable to expect a simple way to match encyclopædia by typing encyclopa or encyclopaedia. But that's not possible. æ is one element whether you count by code unit, code point, or grapheme, but for the purpose of string matching it's desirable to expanded it into two collation elements, a and e. With Intl.Collator you can tell that encyclopæ and encyclopae are the same, but with slice() or Intl.Segmenter, there's no way of avoiding the situation where you end up comparing encyclopæ and encyclopa. To solve this, you need a way of iterating through each collation element.

hsivonen commented 2 years ago

Why can't we just expose sensible utility functions like startsWith and indexOf directly on the Intl.Collator that just does what the builtin JavaScript string functions do. It's just UTF-16 code points, the only difference is that that their equivalence relation is locale dependent which is exactly why we have a Intl.Collator, no?

No, it's not just UTF-16 code points. Internally, a collator implementation maps the sequence of characters to a sequence of collation elements and operates on those.

Currently, ECMA-402 exposes only the comparison operation and ICU4X's collator is at present designed to cover only the operations that are in scope of ECMA-402 at present. Extending that operation with code to support startsWith would not be a huge leap in terms of implementation.

However, extending it in an efficient way to support indexOf would be a much bigger deal. (Obviously, given a startsWith operation, you can implement inefficient indexOf by removing characters from the start of the haystack one by one and doing a startsWith test after each removal, which would map characters to collation elements again and again instead of retaining the mapped collation elements across multiple attempted start positions.)

(To be clear, my concern is not how the entry point is named but whether collation-based search is taken into Web Platform scope and the implications on an ICU4X back end. As noted, I'm somewhat skeptical of collator-based search conceptually, since much of the design of collations goes into defining culturally-relevant "less" and "greater" results while search only cares about the "equal" case being culturally-relevant and "less" and "greater" are implementation details.)

johnfactotum commented 2 years ago

Can Intl.Collator just expose methods for iterating through collation elements? It's not only useful for searching, but for ordering as well.

For example, while you can order strings with Intl.Collator, you can't divide them into sections and give them headings by collation elements, like for example how Wikipedia displays its category pages:

image

To do this, you need to know, for example, that Æson starts with A, or in other words, that its first collation element is A. But that's currently impossible with Intl.Collator.

hsivonen commented 2 years ago

Can Intl.Collator just expose methods for iterating through collation elements?

It seems like a very bad idea to expose this kind of implementation detail in a way that would make the design of the collation elements Web-exposed and, therefore, part of the compatibility surface of the Web and, therefore, presumptively frozen.

You wouldn't know that the first collation element of Æson is A, since collation elements are not characters. At present, in both ICU4C and ICU4X, collation elements are 64-bit values (and, therefore, ill-suited for exposure as JS primitives) whose actual bit contents for a given input may vary from release to release.

johnfactotum commented 2 years ago

Well, can't you do it without exposing the raw values? Something like an opaque CollationElement object, and nothing can be done with them except iterating through them, and comparing one with another or with a string.

I think that while how collation elements are compared is an implementation detail, the collation elements themselves is not. Collation elements are the technical equivalent of the everyday concept of letters of an alphabet. And I think that everyone can agree that the ability to divide text into letters is a useful thing to do, much like dividing text into sentences and words.

Even just knowing the number of collation elements in a string would be immensely helpful. There might not be a perfect answer to the question, "how long is this string?" But the ability to say that the word Æson has 5 letters in it, is IMHO, something very useful for the purpose of internationalization. The fact that this data is currently tied with collation is an implementation detail, it's usefulness is not.

hsivonen commented 2 years ago

Collation elements are the technical equivalent of the everyday concept of letters of an alphabet.

But they aren't. There might be more or fewer collation elements than letters (for any definition of letter). What matters is that the collation elements happen to sort in an appropriate way.

And I think that everyone can agree that the ability to divide text into letters is a useful thing to do, much like dividing text into sentences and words.

Collation element don't give you a division into letters. Collation element give a sequence of thingies that are useful for culturally-appropriate sorting.

But the ability to say that the word Æson has 5 letters in it, is IMHO, something very useful for the purpose of internationalization.

I have doubts about it being useful to be able to say that. In any case, counting the collation elements does not let you say that: There are multiple levels of comparison, but the e.g. primary and secondary weights might be on one collation element or as two separate elements each with the other weight marked ignorable on its level. Thus, the number of collation elements is a implementation detail.

johnfactotum commented 2 years ago

But they aren't. But they aren't. There might be more or fewer collation elements than letters (for any definition of letter). What matters is that the collation elements happen to sort in an appropriate way.

Collation element don't give you a division into letters. Collation element give a sequence of thingies that are useful for culturally-appropriate sorting.

Culturally, in alphabetical scripts, we sort alphabetically. Therefore sorting is alphabetiziing. It stands to reason that collation elements must necessarily be very closely tied in some way to letters in an alphabet. Indeed, even in non-alphabetical writing, sorting is still dependent on breaking strings down to smaller units. These units, alphabetical or not, are not implementation details of sorting. They have linguistic and cultural significance.

Or at the very least, to put it another way, the information the collator has very often pertains to letters of alphabets more closely than what could be otherwise obtained with other methods currently available.

I have doubts about it being useful to be able to say that.

With Intl.Segmenter, you can segment café into four elements. This is useful for many purposes, I believe. If so, why wouldn't it be useful to divide Æson into five elements? If you do not think it valuable to say that Æson is in some sense the same as Aeson, why do you think (I presume) that it would be valuable to say that café is in some sense the same as cafe?

What's the difference between the two cases? Technically, they might be different, but culturally, I do not think there is a difference.

And with Intl.Collator you get to say exactly this. With the appropriate setting, you can say that Æson is the same as Aeson in exactly the same way that café is the same as cafe. Of course, the fact that you could say this with a collator, doesn't mean that it's the right tool for the job. But it is useful to be able to say this.

FrankYFTang commented 5 months ago

Some issues on the web about the use case of this

https://stackoverflow.com/questions/39548303/javascript-string-prototype-contains-with-locale

a related API from Apple https://developer.apple.com/documentation/foundation/nsstring/1412098-localizedcaseinsensitivecontains

ICU doc https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1StringSearch.html#details https://unicode-org.github.io/icu/userguide/collation/string-search.html

FrankYFTang commented 5 months ago

One simple use case in US

Search and found all 17 "beyonce" in the following article, including "Beyoncé", "Beyonce", "BEYONCE", and "beyonce". It should also be able to match "BEYONCÉ" of course. https://www.musicbusinessworldwide.com/beyonce-sued-for-alleged-copyright-infringement-over-sample-on-break-my-soul/

Don't miss "Beyoncé" please! Notice this is a feature even website in US need! :) Some other examples the use case, imaging you have a web application got data from ajax, and allow user to type in the string to search on the client side to find images matching what they type:

Timothée Chalamet, Penélope Cruz, Renée Zellweger, Clémence Poésy, Zoë Kravitz, Gisele Bündchen, Roberto Gómez Bolaños

sffc commented 5 months ago

TG2 discussion: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-05-23.md#add-locale-sensitive-substring-matching-functions-to-intlcollator

It seems to be a real use case with real applications. Although ICU4C supports locale-aware string search via the Collator API, some delegates feel that this is an inferior solution to the problem that we don't want to propagate. We would like to see this problem addressed upstream in Unicode/CLDR and then we can revisit the ECMA-402 side once there is more prior art.