Closed chtenb closed 2 years ago
Maybe would need an unsafe
in the name too in case the passed index splits a codepoint in half? Maybe not, I guess that's an implied possibility whenever you have units in the mix.
It's basically a code point version of charAt
, which is in Unsafe
.
I think if you want this unsafe function, then you should just forgo the Maybe
, eg.
-- | Retrieves a code point and the next starting code unit index from a code unit index.
-- | This assumes the index points to the start of the code point, and will throw when out of bounds.
codePointAt :: Int -> String -> Tuple CodePoint Int
codePointAt = ...
There is also a safe charAt
function isn't there? The signature with the Maybe
would return Nothing in case of split code points.
Here is a reference implementation for the JS backend https://github.com/purescript-contrib/purescript-string-parsers/issues/77#issuecomment-1023857966
Is this library also meant for consumption by other backends? I'm not familiar with backends beside JS and I notice that this package contains foreign JS code, which leads me to think that this package is perhaps JS only.
I notice that this package contains foreign JS code, which leads me to think that this package is perhaps JS only.
Since this library binds to the platform String type, it's necessarily going to be a lot of FFI.
I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself. It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers?
I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself. It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers?
Fair enough!
I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself.
You can't maintain the invariant yourself because there are corner cases which make it impossible. For example, consider a malformed UTF-16 string which contains only one code unit, and that code unit is half of a surrogate pair. There is no way to read that code point successfully, and no way to know before you read that code point that reading it will fail.
It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers?
string-parsers currently has no FFI functions and we'd like to keep it that way so that any backend which implements strings will also be able to use the string-parsers library.
string-parsers also has this long-term problem with parsing code points which has resisted solution, see for example
A function like codePointAtIndexUnit
would allow this problem to be finally solved, and I suspect that that function would also be useful in other domains. There are no other good answers to the commonplace question “how do I read a character out of a string index in O(n) time?” Astral-plane characters show up everywhere these days and all functions which return a Char
are partial in the presence of astral-plan characters.
codePointAtIndexUnit
is a bit of an awkward function though, and it might be awkward to implement that function for other backends. And we should probably discourage beginners from using it by tucking it back in the Unsafe
module or wherever.
Is the problem just with the internal representation of string-parsers? From what I remember, it uses an index pointer and the whole string to act like a pseudo slice, which is where the problems are coming from (this works really well for code units). It's not clear to me if that's even optimal in the presence of code points. Is this function then necessary because you are trying to work around those internals instead of designing new internals?
... I meant to say “how do I read a character out of a string index in O(1) time?”
Is this function then necessary because you are trying to work around those internals instead of designing new internals?
Yes, exactly.
We could design new internals, like for example we could use normal string slicing. We've talked about that, it would work. So if you think this addition to strings it not worth it then we could do that instead.
We could design new internals, like for example we could use normal string slicing. We've talked about that, it would work. So if you think this addition to strings it not worth it then we could do that instead.
My personal opinion at first glance would be to benchmark an implementation that just tracks the tail of the string, and so something like anyDigit could be implementing with uncons
. AFAICT, the regex stuff is going to slice the input string anyway so you aren't gaining anything interesting by the current representation.
I'm closing this issue, since the change of representation in string-parsers
seems to be a viable approach, which is what prompted this request. https://github.com/purescript-contrib/purescript-string-parsers/pull/83
Should this library expose a function which allows constant-time codepoint indexing in string? This would only work in situations where you have the codeunit index of the codepoint handy, so the signature would look like
See also https://github.com/purescript-contrib/purescript-string-parsers/issues/77