unisonweb / base

Unison base libraries
https://share.unison-lang.org/@unison/base
18 stars 6 forks source link

Add aliases for `Bytes.indexOf` and `Text.indexOf` #165

Closed pchiusano closed 1 year ago

pchiusano commented 1 year ago

There's now much more efficient implementations of these as builtins (see this PR)

Bytes.indexOf : Bytes -> Bytes -> Optional Nat
Text.indexOf : Text -> Text -> Optional Nat

I notice Text.indexOf exists but has a slow implementation. So maybe as simple as...

replace Text.indexOf ##Text.indexOf
alias.term ##Bytes.indexOf Bytes.indexOf

Plus docs / tests. @runarorama if you want to assign this to @stew go ahead.

runarorama commented 1 year ago

Text.indexOf looks like it has a bug for multi-byte characters:

    1 | > ##Text.indexOf "foo๐Ÿ‘Š๐Ÿฟ" "bar๐Ÿ‘Š๐Ÿฟfoo๐Ÿ‘Š๐Ÿฟbaz๐Ÿ‘Š๐Ÿฟ"
          โงฉ
          Some 7

    3 | > Text.size "foo๐Ÿ‘Š๐Ÿฟ"
          โงฉ
          5

The ๐Ÿ‘Š๐Ÿฟ character is two codepoints, so the answer should be Some 5

pchiusano commented 1 year ago

Is this a bug in the Text package?? @stew is just calling through to that... this function: https://hackage.haskell.org/package/text-2.0.2/docs/Data-Text-Internal-Lazy-Search.html

Or maybe we're just using it wrong?

runarorama commented 1 year ago

If it's working as intended, I don't understand what it's doing. It's interpreting "๐Ÿ‘Š๐Ÿฟ" to have length 4, but it's two codepoints and 8 bytes. But 4 what?

pchiusano commented 1 year ago

What if you just call it in ghci? Does it still behave incorrectly?

runarorama commented 1 year ago

Yeah

ฮป> indices (Text.pack "foo") (Text.pack "๐Ÿ‘Š๐Ÿฟfoo")
[4]

Is this just due to the fact that Haskell strings are UTF-16?

pchiusano commented 1 year ago

https://hackage.haskell.org/package/text-2.0.2/docs/src/Data.Text.Lazy.html#breakOn - it looks like the indices are byte offsets.

@stew maybe implement in terms of Text.breakOn, it's the size of the first element of the pair.

runarorama commented 1 year ago

Text.breakOn, Text.breakOnEnd, and Text.breakOnAll would be great builtins to have. We could implement a fast Text.indexOf etc. in terms of those.

pchiusano commented 1 year ago

I like breakOnAll and indexOfEnd as new builtins. That could be for later though.

indexOf seems better than breakOn - you can implement breakOn using indexOf, and indexOf can have a direct implementation if we want to make it more efficient. In the cases where you're just Text.drop-ing up to that index, you avoid needlessly allocating a prefix you're just discarding.

runarorama commented 1 year ago

Yeah, if we can get indexOf to work correctly for text, that's ideal.

pchiusano commented 1 year ago

@runarorama fyi, don't know if you saw, but the bug has been fixed, so you can add / replace the existing functions.

runarorama commented 1 year ago

Now there's a different bug:

> ##Text.indexOf "" "foo"

โฌ‡๏ธ

Encountered exception:
Data.Text.Lazy.breakOn: empty input
CallStack ( from HasCallStack ):
    error

I can do a pure Unison check for the empty search string, but it really feels like the builtin should be doing this. The correct index of the empty string is 0.

runarorama commented 1 year ago

Fixed in https://github.com/unisonweb/unison/pull/4101

Replaced the Unison definitions with the builtins and pushed to main.