Closed lkluo closed 5 years ago
Thanks @martinpopel. Not sure if I understand your suggest correctly:
I am thinking the following conversion:
The reason why I am considering to convert dates is to handle ambiguity, for instance, the number in the sentence "in 2018" is actually year rather than regular number, which translation to other language are different from number (which simply copy & paste).
In fact, I have done experiment by replacing all numbers/dates with placeholders, and in general, the BLEU score improves a lot for some sentences, however, it did worse for the sentences where the named entities are not well recognised.
replacing all numbers/dates with placeholders, and in general, the BLEU score improves a lot
And did you reconstructed the numbers back from placeholders before measuring BLEU or did you changed the reference to use placeholders?
You can try using NER to detect numbers representing a year and replace them e.g. with
@martinpopel: The BLEU was measured before post-editing, and the reference was also pre-processed (i.e., using placeholders). I shall re-calculate BLEU after post-editing and use original reference to make a fair comparison. I am aware that Transformer can learn "2018" by its context, what I was thinking was to give a little bit prior information for the model so that it could focus on learning other information; in addition, to shorten sentences which leads to a saving of training time, memory, etc. Next step for me to go is to leave dates as what they originally are, and do minimal conversation for numbers (it has to be done because most of my training data contains a lot of numbers, e.g, 200,000,000 as well as dates).
One more interesting thing I found out was that, the subword will convert any word (even unknown) into subpieces. Therefore, the translations never have unknown token (e.g.,
This may not directly related to tensor2tensor, but I am curious about to what extend NER could improve translation quality in general. Here are some examples where NER could apply.
The benefit of doing this is to shorten sentences and thus yielding a more simplified sentence structure. On the other hand, there are also bad sides. For example, it relies on a good NER system. It may also cause trouble when post-processing those NEs after translated into another language, one of which could be to retain the orders of NEs as of in source sentences.
Any comments, suggestions?