generic method for all different types of unicode string gens

cswiii commented 9 years ago

We already have gen_cjk() and per pull #63 might have gen_cyrillic.

If we wanted to, in the future, support other methods (Tamil, Telugu, etc.), we can see where this would get very cumbersome/duplicitous, very quickly.

It might be good to have some generic function that takes any specific range and plugs it in, and then wrap that with a function specific to the unicode block you want to test.

e.g., instead of

     codepoints = [random.randint(0x4E00, 0x9FCC) for _ in range(length)]
     try:
         # (undefined-variable) pylint:disable=E0602
         output = u''.join(unichr(codepoint) for codepoint in codepoints)
     except NameError:
         output = u''.join(chr(codepoint) for codepoint in codepoints)
     return _make_unicode(output)

...put this into a generate_unicode_range() function that can have codepoint values passed to it, and then use that inside a function for any desired unicode block...

gen_bengali() gen_hebrew() gen_hiragana()

Now, there is a sticky wicket in all this. Some character sets span multiple, non contiguous blocks. More details here:

http://en.wikipedia.org/wiki/Unicode_block

So really, we should be able to pass all desired blocks into a python list, and then either make a single range to rule them all, or simply the ability to choose a random character out of each block within the list.

Ichimonji10 commented 9 years ago

Some character sets span multiple, non contiguous blocks.

From gen_utf8:

Generate codepoints. The valid range of UTF-8 codepoints is 0x0-0x10FFFF, minus the following: 0xC0-0xC1, 0xF5-0xFF and 0xD800-0xDFFF. These 2061 invalid codepoints (2 + 11 + 2048) comprise 0.2% of 0x0-0x10FFFF. Thus, it should be OK to just check for invalid codepoints and generate new ones if need be.

JacobCallahan commented 9 years ago

I think adding an optional tuple parameter to gen_utf8 would be the best implementation. then we could either remove the cjk and cryllic functions or shrink them down to just pass the correct tuple to gen_utf8.

Ichimonji10 commented 9 years ago

So really, we should be able to pass all desired blocks into a python list, and then either make a single range to rule them all, or simply the ability to choose a random character out of each block within the list.

Creating a list that contains all the characters in a given character set and pulling values out is not very streamy. We can find a way to generate a bunch of random-ish characters without creating a list containing tens/hundreds/whatever of thousands of characters and plucking characters out from it.

omaciel / fauxfactory

generic method for all different types of unicode string gens #65