Open cswiii opened 9 years ago
Some character sets span multiple, non contiguous blocks.
From gen_utf8
:
Generate codepoints. The valid range of UTF-8 codepoints is 0x0-0x10FFFF, minus the following: 0xC0-0xC1, 0xF5-0xFF and 0xD800-0xDFFF. These 2061 invalid codepoints (2 + 11 + 2048) comprise 0.2% of 0x0-0x10FFFF. Thus, it should be OK to just check for invalid codepoints and generate new ones if need be.
I think adding an optional tuple parameter to gen_utf8 would be the best implementation. then we could either remove the cjk and cryllic functions or shrink them down to just pass the correct tuple to gen_utf8.
So really, we should be able to pass all desired blocks into a python list, and then either make a single range to rule them all, or simply the ability to choose a random character out of each block within the list.
Creating a list that contains all the characters in a given character set and pulling values out is not very streamy. We can find a way to generate a bunch of random-ish characters without creating a list containing tens/hundreds/whatever of thousands of characters and plucking characters out from it.
We already have gen_cjk() and per pull #63 might have gen_cyrillic.
If we wanted to, in the future, support other methods (Tamil, Telugu, etc.), we can see where this would get very cumbersome/duplicitous, very quickly.
It might be good to have some generic function that takes any specific range and plugs it in, and then wrap that with a function specific to the unicode block you want to test.
e.g., instead of
...put this into a
generate_unicode_range()
function that can havecodepoint
values passed to it, and then use that inside a function for any desired unicode block...gen_bengali()
gen_hebrew()
gen_hiragana()
Now, there is a sticky wicket in all this. Some character sets span multiple, non contiguous blocks. More details here:
http://en.wikipedia.org/wiki/Unicode_block
So really, we should be able to pass all desired blocks into a python list, and then either make a single range to rule them all, or simply the ability to choose a random character out of each block within the list.