krestenkrab / triq

Trifork QuickCheck
http://krestenkrab.github.com/triq/
Apache License 2.0
281 stars 54 forks source link

unicode_binary(1) rarely returns 1 or 2 bytes binaries #49

Closed essen closed 9 years ago

essen commented 9 years ago

Just putting it out there for a later improvement.

When running triq_dom:sample(triq_dom:unicode_binary(1)). one will almost exclusively get 4 bytes binaries, with a few 3 bytes binaries.

Example:

106> triq_dom:sample(triq_dom:unicode_binary(1)).
[<<242,153,166,186>>,
 <<243,157,170,177>>,
 <<240,179,171,147>>,
 <<240,184,166,169>>,
 <<231,152,144>>,
 <<243,136,133,145>>,
 <<242,181,161,160>>,
 <<242,130,177,181>>,
 <<240,151,145,137>>,
 <<240,157,162,145>>,
 <<241,182,145,142>>]

This is not as random as it should be. The first byte will almost always be the same (indicating a 3 or 4 bytes value). When one generates a Unicode binary, it is sometimes because they want to make sure they can parse or validate that UTF-8 binary. If we only have 3 or 4 byte values then we are missing a lot of cases to look out for.

essen commented 9 years ago

Seems like this is just sample returning 3,4 bytes due to the sample size. Sorry for the noise!