libindic / indic-trans

The project aims on adding a state-of-the-art transliteration module for cross transliterations among all Indian languages including English.
GNU Affero General Public License v3.0
256 stars 61 forks source link

Get better result after human validation #36

Open simonefrancia opened 5 years ago

simonefrancia commented 5 years ago

Hi, we are using indic-trans in order to make transliteration from Hindi to Roman/Eng. After applying your model, I have good results in general, but there are still some errors, as some Hindi people have shown us.

चैन should be chain not chaiyn , 
कमल should be kamal not camel , 
मिलने should be milne not milane 

Any advice to get better results? like increase training set? choose between BeamSearch and Viterbi?

Thanks

irshadbhat commented 5 years ago

beamsearch will definitely help in this case. Beamsearch will return n outputs (by default n is 5) and in most of the cases the desired transliteration is either first or second.

from indictrans import Transliterator
trn = Transliterator(source='hin', target='eng', decode='beamsearch')
trn.transform(u'चैन')
[u'chaiyn', u'chain', u'chann', u'chen', u'chan']
trn.transform(u'कमल')
[u'camel', u'kamal', u'camal', u'kamel', u'comel']
trn.transform(u'मिलने')
[u'milane', u'milne', u'miline', u'milene', u'mine']

As you can see the expected result is the second output in all the three cases.

Adding more training data might not help in this case. Since Roman is not the original script for Hindi, one can choose any spelling, for example, for the word बहुत the actual pronunciation is bahut, but most of the Hindi speakers (including me) prefer the spelling bohat. So bohut, bahut, bohat all are correct for me. Saying that the above transliterations are erroneous doesn't seem right. These actually are one of the possible transliterations.

Though the system can actually fail in some cases. After all it is a machine that learned some parameters using some training data and not some actual human being doing the transliteration. Expecting a 100% result does not seem reasonable.

simonefrancia commented 5 years ago

Ok, thanks for clear response. Do you think it's a good idea to consider a transformation certainly good if Viterbi output matches one of n outputs of beamsearch?

irshadbhat commented 5 years ago

I don't think so. Viterbi output matches the first output of beamsearch in almost all the cases.

You can use back-transliteration to estimate the quality of the target transliteration by comparing its back-transliteration with the source word.

from indictrans import Transliterator
trn = Transliterator(source='hin', target='eng', decode='beamsearch')
trn_revr = Transliterator(source='eng', target='hin')  #back-transliteration 
trn.transform(u'मिलने')
[u'milane', u'milne', u'miline', u'milene', u'mine']
print trn_revr.transform('milane')
मिलाने
print trn_revr.transform('milne')
मिलने

As you can see, in the above example, second beamsearch output was able to correctly back-transliterate to the original word and not the first one, so you can prefer 2nd output in this case over the first.

simonefrancia commented 5 years ago

Thank you very much! Mainly I am using terminal commands and I work with text at sentence level. So, If I consider a sentence, how can I choose the best transformation for every tokenized word? How can I tokenize Hindi sentence?

irshadbhat commented 5 years ago

You can tokenize Hindi text using polyglot-tokenizer.

simonefrancia commented 5 years ago

Great! I think I have all to continue. Thanks

Regards

simonefrancia commented 5 years ago

Sorry, last question. How can I include polyglot-tokenizer into indictrans in order to make it work inside binary command? Thanks

simonefrancia commented 5 years ago

Hi, I would like to learn more about this repo. I have some problem in the choice of the correct transliteration from Indian to Romanization. I followed your suggestion and here I will resume my approach.

I have this word ಕವನ in Kannada and If I choose to show the most with beamsearch (n=5), I get these results:

OUTPUT=kavan,cuvan,kavana,covan,kuvan

So how do I chose the "best"? What I do is what you suggest: I "back-transliterate" every word contained in OUTPUT and I check that every back-transliteration corresponds to the initial input word, in this case ಕವನ. So doing, in this example kavan and cuvan are accepted, but kavana,covan,kuvan are discarded. Google translitteration for ಕವನ is kavana, but it is discarded from tools. How can I modify this behavior of the tool?

Thanks

irshadbhat commented 5 years ago

Based on the original question, my suggestion was not to add more training data rather to get some work around, but that was the case because the language pair under consideration was hin-eng. For hin-eng I suggested this because the current model for hin-eng is trained on around 100k pairs. While comparing this to kan-eng, the model is trained on only 10k pairs. So, for kan-eng adding more training data might help. You can go through my blog to learn how to train a new system.

simonefrancia commented 5 years ago

Ok, so I think that this is the link http://irshadbhat.github.io/rom-ind/ Do you know where can I find large corpus in order to do training from scratch? I would like also to know which pairs of language are considered reliable? Thanks

irshadbhat commented 5 years ago

If you read the blog, I have mentioned a couple of sources from were I collected/generated the training data. Apart from those you can search online for additional data. The data I used for training is auto extracted and not gold annotated. Thus it is not 100% correct. If you create some data yourself (say another 10k word pairs), that will give you a much better model.

Regarding your second question, "which pairs of language are considered reliable?", reliability of the model is highly relative. It depends on the downstream task whether you consider the output good or bad. But since you have asked, the best performing models are hin-urd, hin-eng, urd-eng, next will be ben-eng, hin-ben. Rest all are less accurate than these (mainly because of less training data).

simonefrancia commented 5 years ago

Hi, we are facing Tamil transliteration problem (tam-eng) and I would like to know which transliteration phonetic scheme were used for the train, Azhagi or Jaffna, if I'm not wrong. We are having feedbacks from our translation checkers and at the moment they are not so good; but we would like to know if it could be a model problem or instead, our checkers refer to a phonetical system that is not the same which was used for the model training. Thanks