purescript / purescript-strings

String utility functions, Char type, regular expressions.
BSD 3-Clause "New" or "Revised" License
54 stars 71 forks source link

Efficient CodePoint indexing function? #155

Closed chtenb closed 2 years ago

chtenb commented 2 years ago

Should this library expose a function which allows constant-time codepoint indexing in string? This would only work in situations where you have the codeunit index of the codepoint handy, so the signature would look like

codePointAtIndexUnit :: Int -> String -> Maybe (Tuple CodePoint Int)

See also https://github.com/purescript-contrib/purescript-string-parsers/issues/77

garyb commented 2 years ago

Maybe would need an unsafe in the name too in case the passed index splits a codepoint in half? Maybe not, I guess that's an implied possibility whenever you have units in the mix.

natefaubion commented 2 years ago

It's basically a code point version of charAt, which is in Unsafe.

natefaubion commented 2 years ago

I think if you want this unsafe function, then you should just forgo the Maybe, eg.

-- | Retrieves a code point and the next starting code unit index from a code unit index.
-- | This assumes the index points to the start of the code point, and will throw when out of bounds.
codePointAt :: Int -> String -> Tuple CodePoint Int
codePointAt = ...
chtenb commented 2 years ago

There is also a safe charAt function isn't there? The signature with the Maybe would return Nothing in case of split code points.

Here is a reference implementation for the JS backend https://github.com/purescript-contrib/purescript-string-parsers/issues/77#issuecomment-1023857966

Is this library also meant for consumption by other backends? I'm not familiar with backends beside JS and I notice that this package contains foreign JS code, which leads me to think that this package is perhaps JS only.

natefaubion commented 2 years ago

I notice that this package contains foreign JS code, which leads me to think that this package is perhaps JS only.

Since this library binds to the platform String type, it's necessarily going to be a lot of FFI.

I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself. It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers?

chtenb commented 2 years ago

I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself. It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers?

Fair enough!

jamesdbrock commented 2 years ago

I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself.

You can't maintain the invariant yourself because there are corner cases which make it impossible. For example, consider a malformed UTF-16 string which contains only one code unit, and that code unit is half of a surrogate pair. There is no way to read that code point successfully, and no way to know before you read that code point that reading it will fail.

It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers?

string-parsers currently has no FFI functions and we'd like to keep it that way so that any backend which implements strings will also be able to use the string-parsers library.

string-parsers also has this long-term problem with parsing code points which has resisted solution, see for example

A function like codePointAtIndexUnit would allow this problem to be finally solved, and I suspect that that function would also be useful in other domains. There are no other good answers to the commonplace question “how do I read a character out of a string index in O(n) time?” Astral-plane characters show up everywhere these days and all functions which return a Char are partial in the presence of astral-plan characters.

codePointAtIndexUnit is a bit of an awkward function though, and it might be awkward to implement that function for other backends. And we should probably discourage beginners from using it by tucking it back in the Unsafe module or wherever.

natefaubion commented 2 years ago

Is the problem just with the internal representation of string-parsers? From what I remember, it uses an index pointer and the whole string to act like a pseudo slice, which is where the problems are coming from (this works really well for code units). It's not clear to me if that's even optimal in the presence of code points. Is this function then necessary because you are trying to work around those internals instead of designing new internals?

jamesdbrock commented 2 years ago

... I meant to say “how do I read a character out of a string index in O(1) time?”

Is this function then necessary because you are trying to work around those internals instead of designing new internals?

Yes, exactly.

We could design new internals, like for example we could use normal string slicing. We've talked about that, it would work. So if you think this addition to strings it not worth it then we could do that instead.

natefaubion commented 2 years ago

We could design new internals, like for example we could use normal string slicing. We've talked about that, it would work. So if you think this addition to strings it not worth it then we could do that instead.

My personal opinion at first glance would be to benchmark an implementation that just tracks the tail of the string, and so something like anyDigit could be implementing with uncons. AFAICT, the regex stuff is going to slice the input string anyway so you aren't gaining anything interesting by the current representation.

chtenb commented 2 years ago

I'm closing this issue, since the change of representation in string-parsers seems to be a viable approach, which is what prompted this request. https://github.com/purescript-contrib/purescript-string-parsers/pull/83