ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
657 stars 195 forks source link

The model trained for a non-english language is converting the single lower case '' i " into upper case " I " #61

Open preniqivjosa opened 4 years ago

preniqivjosa commented 4 years ago

Hi, I am using punctuator2 library to train a model for Albanian Language which is part of Indo-European languages with latin-derived alphabet.

I use 206,000 articles from an Albanian magazine. So my corpus is large enough to train the model. I have successfully trained the model and I am satisfied with the results. However, when I test the model for a random text, it converts all the single lower case " i-s " into upper case " I ". In Albanian language, a single " i " within a sentence represents a conjunction which should be written in lowercase. So this made me think that the model somehow is using something pre-trained or hardcoded from english language (which I am not aware of).

I checked the code (data.py, models.py and main.py) but I could not notice anything hardcoded for that matter, except the "We.pcl" file referenced in the code which does not exist on my path since I do not use it. Do you have any suggestion or idea why is this happening?

ottokart commented 4 years ago

Hi,

are you using convert_to_readable.py or demo_play_with_model.py scripts? These two convert the first letter of the first word in each sentence to uppercase ("Title"-case or .title() in python)

preniqivjosa commented 4 years ago

Hi @ottokart, Thank you for the reply! I was using a different script created for testing the model, but the problem is solved when using demo_play_with_model.py.