Open sanjanasri opened 7 years ago
Tamil is currently not a supported language for nltk and therefore Yalign fails to load the sentence splitter for Tamil.
I would recommend you to hack the _sentence_splitters
function in yalign/input_conversion.py
to implement a custom sentence splitting algorithm for Tamil.
I does not needs to be anything fancy, if you preprocessed the input to Yaling it could be as simple as text.split('\n')
(ie, splitting one sentence by line).
Thank You, I did something like this "_sentence_splitters = text.split("\n")" in yalign/input_conversion.py, It works for other languages, but returns an empty file for tamil.
If I use command as python /home/yalign/scripts/yalign-align -a ta -b en ta-en 2.txt 1.txt > aligned.txt I am getting key error,
File "/home/sanjana/Documents/Python_pgms/yalign/scripts/yalign-align",
line 63, in
document_a = read_document(args[' '], lang_a) File "/home/sanjana/Documents/Python_pgms/yalign/scripts/yalign-align", line 44, in read_document return text_to_document(text, language) File "build/bdist.linux-x86_64/egg/yalign/input_conversion.py", line 65, in text_to_document File "build/bdist.linux-x86_64/egg/yalign/utils.py", line 82, in missing File "build/bdist.linux-x86_64/egg/yalign/input_conversion.py", line 51, in KeyError: 'ta' So i used en instead of ta. I don't get an error but an empty file. Do not know where I am wrong. Please help
On Sun, Dec 11, 2016 at 7:13 PM, Rafael Carrascosa <notifications@github.com
wrote:
Tamil is currently not a supported language for nltk and therefore Yalign fails to load the sentence splitter for Tamil.
I would recommend you to hack the _sentence_splitters function in yalign/input_conversion.py to implement a custom sentence splitting algorithm for Tamil. I does not needs to be anything fancy, if you preprocessed the input to Yaling it could be as simple as text.split('\n') (ie, splitting one sentence by line).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/machinalis/yalign/issues/9#issuecomment-266282823, or mute the thread https://github.com/notifications/unsubscribe-auth/AMmkNhIupSx-m8FOmTayzi11FIy_ih7Tks5rG_34gaJpZM4LJj40 .
-- Thanks and regards,
Sanjanasri J.P
I am sorry Sanjanasri, but I have too much work right now to walk you through debugging that output.
If you have some programming skills my recommendation remains: hack that function. If you do not, perhaps someone from the community can give you a hand.
Regards,
Rafael
Sanjanasri, at line 31 (or thereabouts) of input_conversion.py is a statement:
CODES_TO_LANGUAGE = { "cs": "czech", "da": "danish", "de": "german", "el": "greek", "en": "english", "es": "spanish", "et": "estonian", "fi": "finnish", "fr": "french", "it": "italian", "nb": "norwegian", "pl": "polish", "pt": "portuguese", "nl": "dutch", "sv": "swedish", "tr": "turkish", }
Suggest you add "ta": "tamil"
to that. You'll probably find more problems after that, but it should stop the key error at least.
Hi,
Traceback (most recent call last): File "yalign-align", line 64, in <module> document_b = read_document(args['<document_b>'], lang_b) File "yalign-align", line 44, in read_document return text_to_document(text, language) File "/home/sanjana/Documents/Python_pgms/yalign/yalign/input_conversion.py", line 65, in text_to_document splitter = _sentence_splitters[language] File "/home/sanjana/Documents/Python_pgms/yalign/yalign/utils.py", line 82, in __missing__ x = self.default_factory(key) File "/home/sanjana/Documents/Python_pgms/yalign/yalign/input_conversion.py", line 51, in <lambda> _sentence_splitters = Memoized(lambda lang: nltkload("tokenizers/punkt/%s.pickle" % CODES_TO_LANGUAGE[lang])) KeyError: 'ta'
It would be great if I am getting an earnest reply.PS:nltk does not support tamil language