String DSL should support valid UTF-8

dcapwell commented 5 years ago

I find that if I use the basicMultilingualPlaneAlphabet from the string dsl that I get back invalid UTF-8; to generate a UTF-8 gen I have the following in my code

public static final Gen<String> UTF_8_GEN =
            SourceDSL.strings()
                    .basicMultilingualPlaneAlphabet()
                    .ofLengthBetween(0, 1024)
            .map(s -> new String(s.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));

This conversion to bytes and back will drop all non-valid code points.

jlink commented 5 years ago

I'm not quite sure what you mean by "I get back invalid UTF-8". As far as I understand it Java uses UTF-16 to encode strings internally and you would specify a charset only when translating to bytes.

dcapwell commented 5 years ago

Sorry for not replying for a long time.

I have a lot of use cases which deal with serialization, so I want to make sure UTF-8 strings are serialized and deserialized without loss; there is a large assumption in most of my code that the original string is valid UTF-8. What I find when I use the code above is that the deserializing the string returns a different value, so the two strings are no longer .equals(o).

Looking up the UTF-8 code points, I see the max defined UTF 8 value is 99k but StringDSL defines 65k. I could totally be reading everything wrong (I use UTF-8, I don't know the spec at all =D), but that would imply to me that I should always get back UTF-8 chars; yet for some reason the string comes back as invalid UTF 8 and the Charset will drop some chars.

My common use case is to deal with UTF-8 strings so I tend to define the generator above in every project.

jlink commented 5 years ago

I dug a bit deeper into the problem and checked for which codepoints forth and back conversion does not produce the same chars. The smallest one I found was 0xD800 which is the beginning of an area where Unicode does currently have no defined characters (see https://unicode-table.com). So the phenomenon will be the same when using UTF-16 for example.

So, maybe a better approach than providing a specialised UTF8 generator could be to (optionally) filter out all codepoints that have no defined character in unicode, e.g. like that

SourceDSL.strings()
                    .basicMultilingualPlaneAlphabet()
                    .ofLengthBetween(0, 1024)
                    .acceptOnlyValid("utf-8")

quicktheories / QuickTheories

String DSL should support valid UTF-8 #54