Arbitrary Strings Not Properly Encodable.

AlexaDeWit commented 6 years ago

Hi Folks, I have ongoing work to deal with string encoding for the purpose of supporting cryptography in purescript, and I am attempting to isolate the cause of the problem I encounter.

I suspect currently that the cause of the problem is actually in the way that quickcheck generates strings via Arbitrary. That is to say, that generating randomly ordered CodeUnits (purescript Char), is not truly compliant with the way UTF-16 strings should be encoded. If we wanted to join randomly generated strings, these should be the (potenitially) multi code-unit strings, aka CodePoints.

I think the best solution would be to redefine arbitrary string such that it is generated from Code Points, not Code Units. Failing this however, a "quick and dirty" solution would be to redefine the char range used to generate strings to always fall within the U+0000 to U+D7FF range. (Represented identically between their code point and code unit).

I will be happy over the weekend to write the PR should I find time and there are no objections raised here. It will also be used to verify my theory by using it with existing text encoding work I've done.

garyb commented 6 years ago

PR would be gladly accepted!

There's Data.Char.Gen and Data.String.Gen in -strings that provide generators over an abstracted MonadGen, so perhaps we could do the fix there, and ensure the Arbitrary implementations here (and in -strongcheck, I saw the issue there too) are implemented with those instead?

AlexaDeWit commented 6 years ago

that would be a good approach, and is infact what strongcheck does(which is also affected by this issue).

A though I posted on the strongcheck issue is: I've been thinking about this a little today though...

I feel an argument could be made that quickCheck -should- throw messed up strings at things for the sake of comprehensiveness, rather than just text units.

Generating "clean well formatted" strings is considerably more complicated, given the utf-16 standard.

Perhaps in a way, a decision should be taken about whether arbitrary strings should be well formatted, or a series of arbitrary code units.

Crosslink, since both libs are affected: purescript-contrib/purescript-strongcheck#54

hdgarrood commented 6 years ago

Generating "clean well formatted" strings is considerably more complicated, given the utf-16 standard.

I am not sure that this is true. I think we could generate an arbitrary array of code points and then call fromCodePointArray without very much difficulty.

I think both behaviours are useful in different cases, so I’m not sure which one to use as the default String instance. In fact, to me, this issue is just another example of why I don’t like the Arbitrary type class.

AlexaDeWit commented 6 years ago

Yeah, I agree with you. I am used to working with Gens in ScalaCheck instead, and having many of the same type.

Perhaps we can look at providing a "well formatted" MonadGen value and add a note in the docs or something.

And yeah, I know how much of a tableflip this is to implement, as I've been fighting my way through this problem for the last week or so, slowly digging further into utf-16, quickcheck, the purescript string library, and so on. I think it's doable though, maybe constraining the values to only renderable characters. I won't have time to do it this weekend myself I'm afraid, which is part of why I made an Issue instead of PR XD

hdgarrood commented 6 years ago

There’s an argument to be made that such a generator should not take that sort of thing into account. See https://www.unicode.org/versions/corrigendum9.html ; since non-characters are permitted in interchange and do not cause text to become ill-formed, they should probably be included in test cases too.

AlexaDeWit commented 6 years ago

I would agree with that, but my concern is more focused on specifically the unpaired code units.

Unpaired code units "are just plain fucky yo"

hdgarrood commented 6 years ago

If you generate some code points and then use fromCodePointArray I think that ought to prevent any lone surrogates appearing.

AlexaDeWit commented 6 years ago

That was my initial plan before I realized a PR to arbitrary would involve way more work than expected. If I have time next week I can write it in terms of a Gen though, unless someone else takes care of it.

athanclark commented 4 years ago

My go-to has just been making the following:

genChar :: Gen Char
genChar = toEnumWithDefaults bottom top <$> withoutControlSeq
  where
    withoutControlSeq = chooseInt 0 (0xD800 - 1) <|> chooseInt (0xDFFF + 1) 65536

genString :: Gen String
genString = fromCharArray <$> arrayOf genChar

Kinda stinks having it leak into every major project though, and it's an oddity in that random code points will work for all strings when testing with argonaut, yet when encoded/decoded from an arraybuffer for instance, stuff will breakdown. IIRC it's due to the control sequence that utf8 has baked-in.

purescript / purescript-quickcheck

Arbitrary Strings Not Properly Encodable. #94