Open simonefrancia opened 5 years ago
beamsearch
will definitely help in this case. Beamsearch will return n
outputs (by default n
is 5) and in most of the cases the desired transliteration is either first or second.
from indictrans import Transliterator
trn = Transliterator(source='hin', target='eng', decode='beamsearch')
trn.transform(u'चैन')
[u'chaiyn', u'chain', u'chann', u'chen', u'chan']
trn.transform(u'कमल')
[u'camel', u'kamal', u'camal', u'kamel', u'comel']
trn.transform(u'मिलने')
[u'milane', u'milne', u'miline', u'milene', u'mine']
As you can see the expected result is the second output in all the three cases.
Adding more training data might not help in this case. Since Roman is not the original script for Hindi, one can choose any spelling, for example, for the word बहुत the actual pronunciation is bahut, but most of the Hindi speakers (including me) prefer the spelling bohat. So bohut, bahut, bohat all are correct for me. Saying that the above transliterations are erroneous doesn't seem right. These actually are one of the possible transliterations.
Though the system can actually fail in some cases. After all it is a machine that learned some parameters using some training data and not some actual human being doing the transliteration. Expecting a 100% result does not seem reasonable.
Ok, thanks for clear response.
Do you think it's a good idea to consider a transformation certainly good if Viterbi output matches one of n outputs of beamsearch
?
I don't think so. Viterbi
output matches the first output of beamsearch
in almost all the cases.
You can use back-transliteration to estimate the quality of the target transliteration by comparing its back-transliteration with the source word.
from indictrans import Transliterator
trn = Transliterator(source='hin', target='eng', decode='beamsearch')
trn_revr = Transliterator(source='eng', target='hin') #back-transliteration
trn.transform(u'मिलने')
[u'milane', u'milne', u'miline', u'milene', u'mine']
print trn_revr.transform('milane')
मिलाने
print trn_revr.transform('milne')
मिलने
As you can see, in the above example, second beamsearch
output was able to correctly back-transliterate to the original word and not the first one, so you can prefer 2nd output in this case over the first.
Thank you very much! Mainly I am using terminal commands and I work with text at sentence level. So, If I consider a sentence, how can I choose the best transformation for every tokenized word? How can I tokenize Hindi sentence?
You can tokenize Hindi text using polyglot-tokenizer.
Great! I think I have all to continue. Thanks
Regards
Sorry, last question. How can I include polyglot-tokenizer into indictrans in order to make it work inside binary command? Thanks
Hi, I would like to learn more about this repo. I have some problem in the choice of the correct transliteration from Indian to Romanization. I followed your suggestion and here I will resume my approach.
I have this word ಕವನ
in Kannada and If I choose to show the most with beamsearch (n=5), I get these results:
OUTPUT=kavan,cuvan,kavana,covan,kuvan
So how do I chose the "best"? What I do is what you suggest: I "back-transliterate" every word contained in OUTPUT and I check that every back-transliteration corresponds to the initial input word, in this case ಕವನ
.
So doing, in this example kavan
and cuvan
are accepted, but kavana
,covan
,kuvan
are discarded.
Google translitteration for ಕವನ
is kavana, but it is discarded from tools.
How can I modify this behavior of the tool?
Thanks
Based on the original question, my suggestion was not to add more training data rather to get some work around, but that was the case because the language pair under consideration was hin-eng
. For hin-eng
I suggested this because the current model for hin-eng
is trained on around 100k pairs. While comparing this to kan-eng
, the model is trained on only 10k pairs. So, for kan-eng
adding more training data might help. You can go through my blog to learn how to train a new system.
Ok, so I think that this is the link http://irshadbhat.github.io/rom-ind/ Do you know where can I find large corpus in order to do training from scratch? I would like also to know which pairs of language are considered reliable? Thanks
If you read the blog, I have mentioned a couple of sources from were I collected/generated the training data. Apart from those you can search online for additional data. The data I used for training is auto extracted and not gold annotated. Thus it is not 100% correct. If you create some data yourself (say another 10k word pairs), that will give you a much better model.
Regarding your second question, "which pairs of language are considered reliable?", reliability of the model is highly relative. It depends on the downstream task whether you consider the output good or bad. But since you have asked, the best performing models are hin-urd
, hin-eng
, urd-eng
, next will be ben-eng
, hin-ben
. Rest all are less accurate than these (mainly because of less training data).
Hi,
we are facing Tamil transliteration problem (tam-eng
) and I would like to know which transliteration phonetic scheme were used for the train, Azhagi or Jaffna, if I'm not wrong.
We are having feedbacks from our translation checkers and at the moment they are not so good; but we would like to know if it could be a model problem or instead, our checkers refer to a phonetical system that is not the same which was used for the model training.
Thanks
Hi, we are using indic-trans in order to make transliteration from Hindi to Roman/Eng. After applying your model, I have good results in general, but there are still some errors, as some Hindi people have shown us.
Any advice to get better results? like increase training set? choose between BeamSearch and Viterbi?
Thanks