Unify CodePoints and CodeUnits parsers

hdgarrood commented 5 years ago

Arising out of discussion in #46.

Right now, we have parsers in the modules Text.Parsing.StringParser.CodePoints and Text.Parsing.StringParser.CodeUnits, which use the same Parser data type, except that the former treats the integer pos field in the parser state as the number of code points we have consumed in the string being parsed, and the latter treats the pos field as the number of code units. This can cause problems if they are mixed:

> runParser (Tuple <$> CP.string "🐱" <*> CP.anyChar) "🐱hi"
(Right (Tuple "🐱" 'h'))

> runParser (Tuple <$> CP.string "🐱" <*> CU.anyChar) "🐱hi"
(Right (Tuple "🐱" '�'))

Addtionally, storing an index into code points is not really justifiable from a performance perspective, since indexing into a string using code points is an O(n) operation, where n is the index; it requires looking at every code point in the string up to the given index.

If we compare the APIs exported by the Text.Parsing.StringParser.Code{Units,Points} modules, they are basically the same; in particular, the CodePoints parsers still use Char almost everywhere, which limits their utility quite severely. As far as I can tell, the only difference between the CodePoints and CodeUnits parsers (now that #46 has been merged) is that the CodePoints ones will fail rather than splitting up surrogate pairs.

I think the ideal solution would be to do the following:

Say that the pos field in the parser state always counts code units
Unify Text.Parsing.StringParser.CodePoints and Text.Parsing.StringParser.CodeUnits into just one module; effectively, get rid of the former and move most/all of the contents of the latter back to Text.Parsing.StringParser.
Clearly demarcate parsers which have the ability to split up surrogate pairs, like anyChar. This could be done with doc-comments or we could move them into their own module Text.Parsing.StringParser.CodeUnits.
Provide CodePoint-based alternatives to any of the parsers which are currently based on Char, so that it is possible to do everything you might want to do without having to resort to using parsers like anyChar which can split up surrogate pairs.

hdgarrood commented 5 years ago

I suppose another option is to continue in the tradition of the CodePoints parsers in that we architect the Parser data type such that it's not possible to split up a surrogate pair, but we come up with a smarter way of representing the internal parser state so that we don't have to sacrifice so much performance. A lazily generated array of code points perhaps?

I think the wider problem, that is, efficiently iterating over strings without making use of the underlying encoding, is something which probably ought to be solved further up in the ecosystem, preferably in the core libraries.

github-actions[bot] commented 4 years ago

This issue is stale because it has been open for 60 days with no activity. Remove the stale label or comment to keep this issue open. Otherwise, this issue will be closed in 14 days.

thomashoneyman commented 3 years ago

Thanks for reopening!

thomashoneyman commented 3 years ago

We may not be able to get this in for 0.14 due to a backlog of other work, so I'm going to remove the label.

JordanMartinez commented 3 years ago

Since this repo is in a weird state, would it be better to revert #59?

hdgarrood commented 3 years ago

I think for now, probably yes. After playing around a bit, I think it might be ideal longer term to keep the two separate modules but replace Char with CodePoint in the CodePoints module.

jamesdbrock commented 2 years ago

I think the ideal solution would be to do the following:

Say that the pos field in the parser state always counts code units

I have a suggestion for how to do that in #77 but it would require a new foreign function, and that new foreign function should probably be in purescript-strings. I actually refactored purescript-parsing to use this technique and the test suite passed, but performance didn't improve, which was what I wanted, so I'm not planning to merge that branch soon.

Unify Text.Parsing.StringParser.CodePoints and Text.Parsing.StringParser.CodeUnits into just one module; effectively, get rid of the former and move most/all of the contents of the latter back to Text.Parsing.StringParser.

Clearly demarcate parsers which have the ability to split up surrogate pairs, like anyChar. This could be done with doc-comments or we could move them into their own module Text.Parsing.StringParser.CodeUnits.

Provide CodePoint-based alternatives to any of the parsers which are currently based on Char, so that it is possible to do everything you might want to do without having to resort to using parsers like anyChar which can split up surrogate pairs.

These other points are exactly how purescript-parsing works since v7.0, so for an example of how this could be implemented see https://pursuit.purescript.org/packages/purescript-parsing/docs/Text.Parsing.Parser.String

jamesdbrock commented 2 years ago

Keep an eye on this issue https://github.com/purescript/purescript/issues/3662

chtenb commented 2 years ago

Is there a good reason to keep the CodeUnits versions of the character parsers around? As far as I can tell the only conceptual difference between the CodeUnits and CodePoints character parsers is the way that anyChar is implemented. For CodeUnits it will just look one byte ahead, while the CodePoints version looks a CodePoint ahead. It seems to me that the latter covers all use cases of the former, isn't that right? (Unless you want to parse a bytestring specifically, but then you'd probably not want to use this library since it is about parsing strings and you'd be misusing the String datatype imho).

JordanMartinez commented 2 years ago

I think the issue is that the position in the string becomes incorrect if you use one anyChar and later use the other anyChar.

chtenb commented 2 years ago

That's an issue with the implementation, and I think we can fix that. But I mean from the users perspective, would there be a reason to prefer parsing code units over code points, ever?

JordanMartinez commented 2 years ago

I'm not sure. I always get code units and code points mixed up and have to re-read what their difference is every time it's discussed.

hdgarrood commented 2 years ago

Yes, I think there are two reasons to keep the CodeUnits versions around:

You will probably want to use CodeUnits if performance is your primary concern. I think there probably are some opportunities for optimising the CodePoints string functions further, but they are always going to have some overhead compared to CodeUnits, at least for the JS backend, as CodeUnits is closest to what JS exposes underneath.
If you want to deal with lone surrogates or be able to split up surrogate pairs, CodeUnits is what you want; these aren’t concepts that really exist at the CodePoints level. This is rare, admittedly.

chtenb commented 2 years ago

I think that the performance argument will be nullified by #83 if we go through with that. Your second point is valid conceptually, but we could just expose different functions for each case then, i.e. anyCodeUnit and anyCodePoint. Currently the derivatives like anyLetter all are specific for either the CodeUnit or the CodePoint case, but this is unnecessary once we unify the representation as proposed in #83 (second comment)

hdgarrood commented 2 years ago

I am not confident that the performance argument will be nullified. A CodePoints-based parser will always be a bit slower than a CodeUnits-based one because CodePoints will always have the extra overhead of PureScript code wrapping it; the issue that the CodePoints parsers in this library are quadratic is separate from that.

chtenb commented 2 years ago

Taking #83 into account, what do you mean exactly with a parser that is CodePoint-based? As far as I can tell running the parser string "abc" does not involve creating purescript CodePoint objects in any way.

hdgarrood commented 2 years ago

It might not at the moment, but it probably should. I would argue that a CodePoints-based version of string should not be able to split a surrogate pair if its argument contains a lone surrogate - I’m pretty sure that the CodeUnits version will split surrogate pairs when asked to.

chtenb commented 2 years ago

What would you expect the CodePoints-based version of string to do if its argument contained a lone surrogate? Always fail? I think that is equivalent to doing input validation on the argument. This could be put on top of the unsafe string parser to give a codePointString parser as an extra safety net for when you don't know the argument statically or something. The same is true for regex I think. Likewise, char, satisfy, oneOf and noneOf we would probably need two versions, one which allows parsing lone surrogates and one which doesn't. (If I'm correct the purescript Char type puts no restriction on what unit it can contain, i.e. it does not exclude U+D800 to U+DFFF) For parsers like anyDigit, anyLetter, etc this wouldn't be necessary, because a digit code unit cannot also be a lone surrogate.

Anyway, these would be new features, because the parsers in the current CodePoint module don't do any of this. I'm also not sure how useful these extra safe versions are in practice, because in the use cases I've seen for this library the arguments were always known at compile time. Imho the only useful parser to split up is anyChar, so we'd have anyCodeUnit and anyCodePoint.

hdgarrood commented 2 years ago

What would you expect the CodePoints-based version of string to do if its argument contained a lone surrogate? Always fail?

No, I’d expect it to successfully parse only if the string it was parsing also contained the same lone surrogate. Generally, I think the CodePoints parsers should behave as if you’ve converted the string you’re parsing into an array of code points - and the way we (and IIRC most UTF-16 implementations) handle lone surrogates is to treat them as a code point with the same scalar value, even though those ranges aren’t assigned.

chtenb commented 2 years ago

In that case I think I'm missing something. Could you give an example of a string parser (an argument and an input string) for which you would expect a CodePoint version and a CodeUnit version to yield different parse results?

hdgarrood commented 2 years ago

Given the input "🍔🍺" and a parser string "🍔\xd38c", I think a CodeUnits-based parser should match and consume the first three code units, and leave the last ('\xdf7a') unconsumed, whereas a CodePoints-based one should fail, as I’d argue CodePoints-based parsers should never split up surrogate pairs. (This is different to what I said a couple of years ago in https://github.com/purescript-contrib/purescript-string-parsers/pull/59#issuecomment-699223093, so I guess I’ve changed my mind.)

chtenb commented 2 years ago

And I assume the CodePoints-based parser string "🍔\xd38c" should succeed if the input was exactly "🍔\xd38c"?

I can see why you would want the guarantee to never split up surrogate pairs. I would want to take this a step further and forbid parsing lone surrogates too in the CodePoints-based string parser.

This makes the behavior of the parser more predictable and easier to explain.
It is easier to implement. We just use the CodeUnits based string parser and add a check that the argument is valid encoded, i.e. valid UTF16 for the JS backend and valid UTF8 for other backends.
My hunch is that this will improve performance as well, because we wouldn't have to meddle with purescript CodePoint objects.

If you write a codepoints-based parser and you want to be able to deal with invalid encoded input, we could provide parsers like anyLoneSurrogate and loneSurrogate c in a UTF16 module. These would never split up a pair and be completely safe to use in a CodePoints context. This makes everything more explicit, and therefore more clear.

To summarize my proposed reasoning, in the CodePoints context

string, regex would only accept valid encoded strings as arguments.
anyChar should parse any valid code point in the BMP, but exclude the UTF16 reserved points. This would be compatible across backends, since it is in accordance with the Unicode spec.
Likewise, char, oneOf and noneOf would only accept code points in the BMP except the UTF16 reserved points.

In the CodeUnits context

string, regex would accept any string as arguments
anyChar should be replaced with anyCodeUnit. In a UTF8 context this would be a byte, in UTF16 a word. I'm note sure what purescript type facilitates this, because Char seems to cover the entire BMP.
Likewise, char, oneOf and noneOf should be replaced with codeUnit variants.

hdgarrood commented 2 years ago

And I assume the CodePoints-based parser string "🍔\xd38c" should succeed if the input was exactly"🍔\xd38c"?

Yes, and it should also succeed if the input starts with "🍔\xd38c" and is then followed by anything other than the second code unit of a surrogate pair.

My objection to the approach of forbidding lone surrogates entirely would be that it differs from how the purescript strings library currently works, and it also differs from how JS strings work (and indeed most UTF-16 implementations).

chtenb commented 2 years ago

It does indeed deviate a bit from the spirit of the purescript strings library. However, I would argue that the current incarnation of the strings library is too UTF16 centric and that it's abstractions do not carry well across purescript backends. This library aims to be cross-backend compatible (albeit failing at that currently), and I think we should try to improve in that area rather than try to stick to the way things are.

So for this particular example, I think that parsing malformed UTF16 is not something that has a place in an generic CodePoint-based string parsing module, just like we wouldn't want to deal with parsing malformed UTF8 if we had a backend that used that representation internally. But we can provide a CodeUnit based parser that is able to do this, just like we do now, and optionally provide some UTF16 specific lone surrogate parsers like I proposed earlier. But in every day practice, malformed strings are not something you want to parse successfully.

It's fairly easy to implement these CodePoint-safe versions of the parser primitives, we'd just need a function from purescript-strings that let's us validate the encoding of a given string. This function does not exist at the moment it seems.

purescript-contrib / purescript-string-parsers

Unify CodePoints and CodeUnits parsers #48