riverrun / genxword

Crossword generator written in Python.
GNU General Public License v3.0
162 stars 40 forks source link

use new ComplexString class to store Unicode scripts #8

Closed mapmeld closed 8 years ago

mapmeld commented 8 years ago

Hi David,

I was writing a Burmese crossword puzzle in JS and found your generator

This pull request adds more Unicode scripts and could probably help with the Thai version, too. Instead of the word being stored as a string, I use a new class (ComplexString) which is divided into "blocks" instead of chars. So the Burmese word for rice ထမင်း would be three blocks (ထ + မ + င်း) even though it is five chars. In Burmese there is a special kind of stacking character (င် + ္ + ဂ) = င်္ဂ and this counts as one block, too. Bengali and Devanagari have something similar.

This RegEx which I've been working on should accept accents from Latin, Devanagari, Burmese/Myanmar, Tamil, and Bengali scripts.

I wasn't able to test PDF / PNG output because that didn't work on my machine to begin with. But the command line printout looks good.

-- Nick

riverrun commented 8 years ago

This looks great. I want to look into it further, and then I'll get back to you. Thanks for your help.

riverrun commented 8 years ago

I've merged it, but I'm probably going to make a few changes to the implementation of ComplexString. I have a question about the combo_characters - do they usually come after the letter they modify (like the accents)? Thanks for your help.

mapmeld commented 8 years ago

I'm not sure what the linguistic term for them is, but they combine two characters (and their accents) into one:

In Burmese they're stacked vertically: က ္ က = က္က and the middle character ္ is an invisible hint to combine the two... I call that a combo_character

In Devanagari they are merged horizontally by a different character त ् व = त्व

riverrun commented 8 years ago

Thanks for the explanation and the examples. I think I understand it now.