Open devongovett opened 4 years ago
What you can do today:
What you can't currently do:
We need some reasonable way to do simple locale sensitive string search. Intl.Segmenter
ain't it. Regular expressions are useful but comparatively expensive and there's no RegExp.escape
(this proposal was rejected). There's no simple, safe and reliable way to construct a case insensitive regular expression for searching (it is also slower due to the added complexity of regular expressions).
startsWith
endsWith
includes
indexOf
Are all well known names for functions that have lots of utility but cannot be reasonably implemented using just compare
.
You can kinda do it with Intl.Collator already with some very careful slicing. But you have to make sure to normalize both strings the same way first. And if you want to skip punctuation it's more complicated. Here's a library that attempts to implement that: https://github.com/arty-name/locale-index-of.
So while it's possible, I think it would be really nice to include it in the language. This would also make it more likely that developers implement these functions correctly as opposed to naively.
Sure; this sounds like good material for a proposal.
Would you mind sharing some code snippets to help people get a better idea? A list of function names is good, but actual code snippets are better.
Sure. My main use cases are around filtering a list of items displayed in a user interface as the user types. For example, an autocomplete input or filterable table. For these usecases, a fuzzier search is required rather than an exact match. We still want to show results even if the user does not type in exactly the same case or diacritics as the items being filtered.
Here's an example of what I'd like to be able to do. Using includes
here but startsWith
and endsWith
are equally as useful.
let items = ['Offenbach', 'Österreich', 'Aardvark', 'Kangaroo'];
let collator = new Intl.Collator('fr', {usage: 'search', sensitivity: 'base'});
let search = 'o';
let matches = items.filter(v => collator.includes(v, search));
// => ['Offenbach', 'Österreich', 'Kangaroo']
In some cases it may also be useful to know the range of the match. For example, perhaps the application wishes to highlight the matched string in the UI. In this case, indexOf
may not be enough because the range that matched may have a different length than the original string. Perhaps a search
method could be introduced which returns an object with both the starting and ending positions, or the starting position and a length. Or maybe this is an iterator for looping through all of the matches in a string.
let items = ['Offenbach', 'Österreich', 'Aardvark', 'Kangaroo'];
let collator = new Intl.Collator('fr', {usage: 'search', sensitivity: 'base'});
let search = 'a';
let matches = items
.map(v => ({item: v, matches: [...collator.search(v, search)]}))
.filter(m => m.matches.length > 0);
// =>
// [
// {item: 'Offenbach', matches: [{index: 6, length: 1}],
// {item: 'Aardvark', matches: [{index: 0, length: 1}, {index: 1, length: 1}, {index: 5, length: 1}],
// {item: 'Kangaroo', matches: [{index: 1, length: 1}, {index: 4, length: 1}]
// ]
I've been here before but this was today's reason (there's some TypeScript in there but mostly it just JavaScript).
// this is a React hook
function useFilter(): [string, (filter: string) => void, RegExp] {
const [filter, setFilter] = React.useState("")
const debounced = useDebounce(filter)
const pattern = React.useMemo(() => {
return new RegExp(debounced.replace(/[^\w]/g, "\\$&"), "i") // not too happy about this
}, [debounced])
return [filter, setFilter, pattern]
}
const COLLATOR = new Intl.Collator(undefined, { sensitivity: "base" }) // ignore case, ignore accent
// this is some render code, what I want to do here is just filter the list of stuff based on the filter
const [filter, setFilter, filterPattern] = useFilter()
data
?.filter((d) => filterPattern.test(d.v[0]))
.sort((a, b) => COLLATOR.compare(a.v[0], b.v[0]))
.slice(0, 50)
.map((d) => {
const key = encodeEntityId(d.e);
return (
<li key={key}>
<NavLink to={`/property/${key}`}>{d.v[0]}</NavLink>
</li>
);
}) ?? null;
What I would like to be able to do here is to just use the COLLATOR
instance and something like includes
. I would prefer to be able to do COLLATOR.includes(d.v[0], filter)
over filterPattern.test(d.v[0])
. Now, includes(...)
is just a more convenient form of indexOf(...) !== -1
which to me is okay. indexOf
can also be used in various substring matching algorithms. For example,
export type Pattern = string[]
const ANY: Pattern = []
export function createPattern(pattern: string, wildcard = "*"): Pattern {
if (pattern === wildcard) {
return ANY
}
return pattern.split(wildcard)
}
export function isMatch(s: string, pattern: string | Pattern): boolean {
const p: Pattern = typeof pattern === "string" ? createPattern(pattern) : pattern
if (p === ANY) {
return true
}
if (p.length === 1) {
return p[0] === s
}
if (!s.startsWith(p[0])) {
return false
}
if (!s.endsWith(p[p.length - 1])) {
return false
}
return _isMatch(s, p[0].length, p, 1, p.length - 1)
}
function _isMatch(s: string, i: number, pattern: Pattern, j: number, k: number): boolean {
const p = pattern[j]
if (!(i + p.length <= s.length)) {
return false
}
if (j === k) {
return true
}
const x = s.indexOf(p, i)
if (x === -1) {
return false
}
if (_isMatch(s, x + p.length, pattern, j + 1, k)) {
return true
}
return _isMatch(s, i + 1, pattern, j, k)
}
The above subsequence matching algorithm leverages startsWith
, endsWith
and indexOf
to figure out if a*b*c
matches abc
. It would be really awesome if we could have these functions available on the collator. includes
isn't strictly necessary but startsWith
, endsWith
and indexOf
are.
I would just assume that these would be available as defined today on the String
prototype but on the Collator
prototype with the corresponding identical behavior.
startsWith(s, searchString[, position])
endsWith(s, searchString[, position])
indexOf(s, searchValue [, fromIndex])
I don't believe includes
and lastIndexOf
or other less common string utility functions are as essential as the ones above.
"Unicode-aware fuzzy text matching"
There's nothing fuzzy about what's going on here. This is simply about being able to search for substrings using the Intl.Collator
. I am not advocating for anything like isMatch
actually being available on the Intl.Collator
it is just an example of what can be made using startsWith
, endsWith
and indexOf
.
ICU has StringSearch which looks like it uses a collator internally to implement this. See also: http://userguide.icu-project.org/collation/icu-string-search-service and http://www.unicode.org/reports/tr10/#Searching
I would like to add that I'm trying to build a database in JavaScript and having more fine grained control over string operations to build case insensitive (or with respect to a specific collation) text indexes is currently not impossible (or extremely inefficient). I don't believe that any of the ongoing standards work here is going to help with that nor do I believe that they will be able to address the problem in the context of my domain.
I don't know how I would move this forward but having, at the very least indexOf
exposed by the Intl.Collator
instance goes a long way to solve a lot of the problems I'm currently facing with being able to use the Intl.Collator
APIs to solve actual problems in real world applications.
Note that this sort of feature needs a fair bit more code than just a Collator, and you can't just build it on top of compare(). You basically need to incrementally map appropriate sequences of characters to sequences of collation elements and match the desired parts of those (depending on the collation strength) to the collation elements for the pattern. There are options for matching behavior (asymmetric search is somewhat popular for handling diacritics) and for which match boundaries to return when there are multiple possibilities, related to Unicode normalization and/or word boundaries.
When proposing the scope for ICU4X collator MVP, I looked at what Firefox does for ctrl-f (not collator-based) and what CLDR search collations accomplish compared to that.
I think exposing a search-specific API that's based on fold case, ignoring diacritics, and a bunch of special cases to accomplish other things—these appear surprisingly few—that CLDR search collations accomplish (but Firefox doesn't) is seriously worth considering compared to adding a collation-based search API: The main concern of collation is to establish an ordering that is appropriate according to human expectations. However, for search, only matches need to be appropriate according to human expectations and, to the extent there is internally a concept of "less" or "greater", the order is an implementation detail.
As a result, the key design driver for what collation data needs to be like is not relevant to search. At least when the search collations are stored as separate instances of data, the cost/benefit of collation-based search (when considering the data size a cost) looks like something that could be improved upon considerably by a search-specific design. Also, (anecdotally by visual observation of ctrl-f on a large HTML file in Firefox vs. Chrome) not having to map the haystack to collation elements (fold case is a simpler transformation) seems beneficial for performance.
Can someone explain to me what the corner cases are here...
I want to do a simple prefix search. I can do this by simply chopping strings with .slice()
and the Intl.Collator just fine. The problem with .slice()
is that it is suboptimal because each slice will be an additional allocation (this will slow down searching considerably). Why can't we just expose sensible utility functions like startsWith
and indexOf
directly on the Intl.Collator that just does what the builtin JavaScript string functions do. It's just UTF-16 code points, the only difference is that that their equivalence relation is locale dependent which is exactly why we have a Intl.Collator, no?
I get that a code point is not a "letter" but the user submitted string will not be broken up at arbitrary code points. Even if a malicious user crafted some nasty string of code points I am perfectly okay with them ending up with a nonsense search result.
The issue title and description talk about Intl.Collator
, and I feel that the discussion focuses a lot on whether Intl.Collator
is an appropriate place for such APIs.
I understand that searching and collating are different tasks. Would this topic become less controversial, if instead of asking for new methods on Intl.Collator
we would rather consider String.prototype
?
We already have locale-aware String.prototype.localeCompare
and a few other string methods with locale
in their names. What if we added String.prototype.localeIndexOf
as my library locale-index-of
does? How about String.prototype.localeIncludes
?
This way we could easily lose all objections about collation being different and irrelevant to the topic. Then we could focus on the goals of developers, like "I have to find the position of a substring in a text, while smartly ignoring case, diacritics, and punctuation".
consider String.prototype?
While convenient, I think this is the wrong approach because each locale specific method now has to go through a lookup phase to find a "collation".
The search module from the ICU library appears to be doing exactly this, it can do indexOf
and lastIndexOf
using a specific collation.
Should we introduce Intl.Search
then? Rather just add, indexOf
and lastIndexOf
to Intl.Collator
Can someone explain to me what the corner cases are here...
I want to do a simple prefix search. I can do this by simply chopping strings with
.slice()
and the Intl.Collator just fine.
Conceptually,slice()
is a segmenter. It's just a non-locale-aware one (or rather, not even unicode-aware). So I think if you're using slice()
with the intention of doing something locale-dependent later, usually you'd be better off using Intl.Segmenter
instead.
For example, suppose you want to tell whether f̷̮̺̻̞̭͉̖̳̲͓͕̲̞̎͌͗͌̾̓͐̌̆ȯ̸̢̩̇͊͐̅͠ö̷͕̲̜̗̻̤̲̖̘̬̳́̂͊̎̂́̌̆̏̎͛͊̓͘͜͝b̶̥̫̫͖̖̤̹̱͋̈́̓̒́̀̅̿̈́͊͜͠ą̸̪͉͓̲͚̠͕̪̿͊̒̔̈̒͑̕͜ͅr̸͇̖͉̭̀̐̇̂́͒b̴̨̢̫̪͇̗̞̮̣̃̃͗̿̕͝a̴̱̜͍͚̻̳͎̩̫͎͚͍̩̓̀͗̿̀̂̿̽͑̂̇̓́͘̚͠z̵͎̣̓̔̇͋͐
starts with foobar
or foobaz
using a collator. With slice()
, either you give up and accept that the it doesn't start with foobar
, or you basically have to loop through the whole string to be sure that it doesn't start with foobaz
. With Intl.Segmenter
, you just take and compare the first six segments at the start of the string. Even better, since Intl.Segmenter
supports different granularities, you can match whole words (which is a very common use case) by simply changing the granularity. You get this pretty much "for free" without having to change anything else in your code.
However, in many cases, even Intl.Segmenter
is not good enough because it does not segment text in a collation-aware manner. For example, It's not unreasonable to expect a simple way to match encyclopædia
by typing encyclopa
or encyclopaedia
. But that's not possible. æ
is one element whether you count by code unit, code point, or grapheme, but for the purpose of string matching it's desirable to expanded it into two collation elements, a
and e
. With Intl.Collator
you can tell that encyclopæ
and encyclopae
are the same, but with slice()
or Intl.Segmenter
, there's no way of avoiding the situation where you end up comparing encyclopæ
and encyclopa
. To solve this, you need a way of iterating through each collation element.
Why can't we just expose sensible utility functions like
startsWith
andindexOf
directly on the Intl.Collator that just does what the builtin JavaScript string functions do. It's just UTF-16 code points, the only difference is that that their equivalence relation is locale dependent which is exactly why we have a Intl.Collator, no?
No, it's not just UTF-16 code points. Internally, a collator implementation maps the sequence of characters to a sequence of collation elements and operates on those.
Currently, ECMA-402 exposes only the comparison operation and ICU4X's collator is at present designed to cover only the operations that are in scope of ECMA-402 at present. Extending that operation with code to support startsWith
would not be a huge leap in terms of implementation.
However, extending it in an efficient way to support indexOf
would be a much bigger deal. (Obviously, given a startsWith
operation, you can implement inefficient indexOf
by removing characters from the start of the haystack one by one and doing a startsWith
test after each removal, which would map characters to collation elements again and again instead of retaining the mapped collation elements across multiple attempted start positions.)
(To be clear, my concern is not how the entry point is named but whether collation-based search is taken into Web Platform scope and the implications on an ICU4X back end. As noted, I'm somewhat skeptical of collator-based search conceptually, since much of the design of collations goes into defining culturally-relevant "less" and "greater" results while search only cares about the "equal" case being culturally-relevant and "less" and "greater" are implementation details.)
Can Intl.Collator
just expose methods for iterating through collation elements? It's not only useful for searching, but for ordering as well.
For example, while you can order strings with Intl.Collator
, you can't divide them into sections and give them headings by collation elements, like for example how Wikipedia displays its category pages:
To do this, you need to know, for example, that Æson
starts with A
, or in other words, that its first collation element is A
. But that's currently impossible with Intl.Collator
.
Can
Intl.Collator
just expose methods for iterating through collation elements?
It seems like a very bad idea to expose this kind of implementation detail in a way that would make the design of the collation elements Web-exposed and, therefore, part of the compatibility surface of the Web and, therefore, presumptively frozen.
You wouldn't know that the first collation element of Æson
is A
, since collation elements are not characters. At present, in both ICU4C and ICU4X, collation elements are 64-bit values (and, therefore, ill-suited for exposure as JS primitives) whose actual bit contents for a given input may vary from release to release.
Well, can't you do it without exposing the raw values? Something like an opaque CollationElement
object, and nothing can be done with them except iterating through them, and comparing one with another or with a string.
I think that while how collation elements are compared is an implementation detail, the collation elements themselves is not. Collation elements are the technical equivalent of the everyday concept of letters of an alphabet. And I think that everyone can agree that the ability to divide text into letters is a useful thing to do, much like dividing text into sentences and words.
Even just knowing the number of collation elements in a string would be immensely helpful. There might not be a perfect answer to the question, "how long is this string?" But the ability to say that the word Æson
has 5 letters in it, is IMHO, something very useful for the purpose of internationalization. The fact that this data is currently tied with collation is an implementation detail, it's usefulness is not.
Collation elements are the technical equivalent of the everyday concept of letters of an alphabet.
But they aren't. There might be more or fewer collation elements than letters (for any definition of letter). What matters is that the collation elements happen to sort in an appropriate way.
And I think that everyone can agree that the ability to divide text into letters is a useful thing to do, much like dividing text into sentences and words.
Collation element don't give you a division into letters. Collation element give a sequence of thingies that are useful for culturally-appropriate sorting.
But the ability to say that the word
Æson
has 5 letters in it, is IMHO, something very useful for the purpose of internationalization.
I have doubts about it being useful to be able to say that. In any case, counting the collation elements does not let you say that: There are multiple levels of comparison, but the e.g. primary and secondary weights might be on one collation element or as two separate elements each with the other weight marked ignorable on its level. Thus, the number of collation elements is a implementation detail.
But they aren't. But they aren't. There might be more or fewer collation elements than letters (for any definition of letter). What matters is that the collation elements happen to sort in an appropriate way.
Collation element don't give you a division into letters. Collation element give a sequence of thingies that are useful for culturally-appropriate sorting.
Culturally, in alphabetical scripts, we sort alphabetically. Therefore sorting is alphabetiziing. It stands to reason that collation elements must necessarily be very closely tied in some way to letters in an alphabet. Indeed, even in non-alphabetical writing, sorting is still dependent on breaking strings down to smaller units. These units, alphabetical or not, are not implementation details of sorting. They have linguistic and cultural significance.
Or at the very least, to put it another way, the information the collator has very often pertains to letters of alphabets more closely than what could be otherwise obtained with other methods currently available.
I have doubts about it being useful to be able to say that.
With Intl.Segmenter
, you can segment café
into four elements. This is useful for many purposes, I believe. If so, why wouldn't it be useful to divide Æson
into five elements? If you do not think it valuable to say that Æson
is in some sense the same as Aeson
, why do you think (I presume) that it would be valuable to say that café
is in some sense the same as cafe
?
What's the difference between the two cases? Technically, they might be different, but culturally, I do not think there is a difference.
And with Intl.Collator
you get to say exactly this. With the appropriate setting, you can say that Æson
is the same as Aeson
in exactly the same way that café
is the same as cafe
. Of course, the fact that you could say this with a collator, doesn't mean that it's the right tool for the job. But it is useful to be able to say this.
Some issues on the web about the use case of this
https://stackoverflow.com/questions/39548303/javascript-string-prototype-contains-with-locale
a related API from Apple https://developer.apple.com/documentation/foundation/nsstring/1412098-localizedcaseinsensitivecontains
ICU doc https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1StringSearch.html#details https://unicode-org.github.io/icu/userguide/collation/string-search.html
One simple use case in US
Search and found all 17 "beyonce" in the following article, including "Beyoncé", "Beyonce", "BEYONCE", and "beyonce". It should also be able to match "BEYONCÉ" of course. https://www.musicbusinessworldwide.com/beyonce-sued-for-alleged-copyright-infringement-over-sample-on-break-my-soul/
Don't miss "Beyoncé" please! Notice this is a feature even website in US need! :) Some other examples the use case, imaging you have a web application got data from ajax, and allow user to type in the string to search on the client side to find images matching what they type:
Timothée Chalamet, Penélope Cruz, Renée Zellweger, Clémence Poésy, Zoë Kravitz, Gisele Bündchen, Roberto Gómez Bolaños
TG2 discussion: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-05-23.md#add-locale-sensitive-substring-matching-functions-to-intlcollator
It seems to be a real use case with real applications. Although ICU4C supports locale-aware string search via the Collator API, some delegates feel that this is an inferior solution to the problem that we don't want to propagate. We would like to see this problem addressed upstream in Unicode/CLDR and then we can revisit the ECMA-402 side once there is more prior art.
It would be useful for
Intl.Collator
to support some additional methods for locale sensitive substring matching. For examplestartsWith
,endsWith
,includes
,indexOf
, etc. These methods exist onString.prototype
, but do not support the options thatIntl.Collator
provides for case, diacritic, and punctuation insensitive searching.It is possible to implement these on top of the existing collator
compare
function, but it is challenging and potentially hacky to do so correctly. I think a built in solution in the language would make sense for a lot of usecases, e.g. client side filtering of lists.Has there been a proposal for anything like this already?