Closed pchiusano closed 1 year ago
Text.indexOf
looks like it has a bug for multi-byte characters:
1 | > ##Text.indexOf "foo๐๐ฟ" "bar๐๐ฟfoo๐๐ฟbaz๐๐ฟ"
โงฉ
Some 7
3 | > Text.size "foo๐๐ฟ"
โงฉ
5
The ๐๐ฟ character is two codepoints, so the answer should be Some 5
Is this a bug in the Text package?? @stew is just calling through to that... this function: https://hackage.haskell.org/package/text-2.0.2/docs/Data-Text-Internal-Lazy-Search.html
Or maybe we're just using it wrong?
If it's working as intended, I don't understand what it's doing. It's interpreting "๐๐ฟ"
to have length 4, but it's two codepoints and 8 bytes. But 4 what?
What if you just call it in ghci? Does it still behave incorrectly?
Yeah
ฮป> indices (Text.pack "foo") (Text.pack "๐๐ฟfoo")
[4]
Is this just due to the fact that Haskell strings are UTF-16?
https://hackage.haskell.org/package/text-2.0.2/docs/src/Data.Text.Lazy.html#breakOn - it looks like the indices are byte offsets.
@stew maybe implement in terms of Text.breakOn, it's the size of the first element of the pair.
Text.breakOn
, Text.breakOnEnd
, and Text.breakOnAll
would be great builtins to have. We could implement a fast Text.indexOf
etc. in terms of those.
I like breakOnAll
and indexOfEnd
as new builtins. That could be for later though.
indexOf
seems better than breakOn
- you can implement breakOn
using indexOf
, and indexOf
can have a direct implementation if we want to make it more efficient. In the cases where you're just Text.drop
-ing up to that index, you avoid needlessly allocating a prefix you're just discarding.
Yeah, if we can get indexOf
to work correctly for text, that's ideal.
@runarorama fyi, don't know if you saw, but the bug has been fixed, so you can add / replace the existing functions.
Now there's a different bug:
> ##Text.indexOf "" "foo"
โฌ๏ธ
Encountered exception:
Data.Text.Lazy.breakOn: empty input
CallStack ( from HasCallStack ):
error
I can do a pure Unison check for the empty search string, but it really feels like the builtin should be doing this. The correct index of the empty string is 0
.
Fixed in https://github.com/unisonweb/unison/pull/4101
Replaced the Unison definitions with the builtins and pushed to main.
There's now much more efficient implementations of these as builtins (see this PR)
I notice
Text.indexOf
exists but has a slow implementation. So maybe as simple as...Plus docs / tests. @runarorama if you want to assign this to @stew go ahead.