machinalis / yalign

A sentence aligner for comparable corpora
Other
127 stars 31 forks source link

Key error in Alignment #9

Open sanjanasri opened 7 years ago

sanjanasri commented 7 years ago

Hi,

 I have successfully created the model for other languages tamil and english. But, when try to do alignment `python yalign-align -a en -b ta en-ta en.txt ta.txt > aligned.txt`. I am getting the keyerror 

Traceback (most recent call last): File "yalign-align", line 64, in <module> document_b = read_document(args['<document_b>'], lang_b) File "yalign-align", line 44, in read_document return text_to_document(text, language) File "/home/sanjana/Documents/Python_pgms/yalign/yalign/input_conversion.py", line 65, in text_to_document splitter = _sentence_splitters[language] File "/home/sanjana/Documents/Python_pgms/yalign/yalign/utils.py", line 82, in __missing__ x = self.default_factory(key) File "/home/sanjana/Documents/Python_pgms/yalign/yalign/input_conversion.py", line 51, in <lambda> _sentence_splitters = Memoized(lambda lang: nltkload("tokenizers/punkt/%s.pickle" % CODES_TO_LANGUAGE[lang])) KeyError: 'ta' It would be great if I am getting an earnest reply.

PS:nltk does not support tamil language

rafacarrascosa commented 7 years ago

Tamil is currently not a supported language for nltk and therefore Yalign fails to load the sentence splitter for Tamil.

I would recommend you to hack the _sentence_splitters function in yalign/input_conversion.py to implement a custom sentence splitting algorithm for Tamil. I does not needs to be anything fancy, if you preprocessed the input to Yaling it could be as simple as text.split('\n') (ie, splitting one sentence by line).

sanjanasri commented 7 years ago

Thank You, I did something like this "_sentence_splitters = text.split("\n")" in yalign/input_conversion.py, It works for other languages, but returns an empty file for tamil.

If I use command as python /home/yalign/scripts/yalign-align -a ta -b en ta-en 2.txt 1.txt > aligned.txt I am getting key error,

File "/home/sanjana/Documents/Python_pgms/yalign/scripts/yalign-align",

line 63, in document_a = read_document(args[''], lang_a) File "/home/sanjana/Documents/Python_pgms/yalign/scripts/yalign-align", line 44, in read_document return text_to_document(text, language) File "build/bdist.linux-x86_64/egg/yalign/input_conversion.py", line 65, in text_to_document File "build/bdist.linux-x86_64/egg/yalign/utils.py", line 82, in missing File "build/bdist.linux-x86_64/egg/yalign/input_conversion.py", line 51, in KeyError: 'ta'

So i used en instead of ta. I don't get an error but an empty file. Do not know where I am wrong. Please help

On Sun, Dec 11, 2016 at 7:13 PM, Rafael Carrascosa <notifications@github.com

wrote:

Tamil is currently not a supported language for nltk and therefore Yalign fails to load the sentence splitter for Tamil.

I would recommend you to hack the _sentence_splitters function in yalign/input_conversion.py to implement a custom sentence splitting algorithm for Tamil. I does not needs to be anything fancy, if you preprocessed the input to Yaling it could be as simple as text.split('\n') (ie, splitting one sentence by line).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/machinalis/yalign/issues/9#issuecomment-266282823, or mute the thread https://github.com/notifications/unsubscribe-auth/AMmkNhIupSx-m8FOmTayzi11FIy_ih7Tks5rG_34gaJpZM4LJj40 .

-- Thanks and regards,

Sanjanasri J.P

rafacarrascosa commented 7 years ago

I am sorry Sanjanasri, but I have too much work right now to walk you through debugging that output.

If you have some programming skills my recommendation remains: hack that function. If you do not, perhaps someone from the community can give you a hand.

Regards,

Rafael

simontite-capita-ti commented 7 years ago

Sanjanasri, at line 31 (or thereabouts) of input_conversion.py is a statement: CODES_TO_LANGUAGE = { "cs": "czech", "da": "danish", "de": "german", "el": "greek", "en": "english", "es": "spanish", "et": "estonian", "fi": "finnish", "fr": "french", "it": "italian", "nb": "norwegian", "pl": "polish", "pt": "portuguese", "nl": "dutch", "sv": "swedish", "tr": "turkish", }

Suggest you add "ta": "tamil" to that. You'll probably find more problems after that, but it should stop the key error at least.