Is it a good idea to apply NER for translation

lkluo commented 6 years ago

This may not directly related to tensor2tensor, but I am curious about to what extend NER could improve translation quality in general. Here are some examples where NER could apply.

numbers, especially long digits e.g., 100,000,000. To my understand, it would be split into a serious of tokens with each element of being one digits. If it could be replaced by a special token such as _number, the length will be reduced to one.
dates, for example 2 October 2018 will also be one token if converted properly.

The benefit of doing this is to shorten sentences and thus yielding a more simplified sentence structure. On the other hand, there are also bad sides. For example, it relies on a good NER system. It may also cause trouble when post-processing those NEs after translated into another language, one of which could be to retain the orders of NEs as of in source sentences.

Any comments, suggestions?

martinpopel commented 6 years ago

You can try replacing numbers with a placeholder (e.g. keeping single digit numbers, and replacing other numbers with pattern-tags, e.g. 123.45 -> ###.##) as a preprocessing (already before training the subword vocabulary). It may be a bit of work to extract the word alignment, so you can replace the translated placeholders back with the numbers in post-processing, so you handle correctly also the cases where the wordorder of two numbers (with the same number of digits) is swapped. Note also that for some language pairs, you need to "translate" numbers as well, e.g. if the target language uses a decimal comma instead of a decimal dot. Note also that the default T2T solution with subwords works quite well and I haven't seen any errors related to translating numbers in Transformer output. Of course, another question are localization issues, e.g. if you want to convert miles to kilometers etc.
Dates need to be translated for most language pairs. Moreover, the numbers in dates are restricted (1-31). "2 October 2018" will surely be encoded as three subwords, which is OK. I think trying to convert dates into one token and translate it by rules will make more harm than good.

lkluo commented 6 years ago

Thanks @martinpopel. Not sure if I understand your suggest correctly:

for single digit, keep it and no conversion, e.g, 1 -> 1, 9 -> 9
for other types of numbers, replace every single digit with tags to preserver their patter, e.g., 123.45 -> ###.##, 1% -> #%, 100,000 -> ###,###?

I am thinking the following conversion:

regular numbers with more than one digits are replace by #, e.g., 20002 -> #
multiple # are replaced by single #, e.g., ###.## -> #.# to simplify tokens.

The reason why I am considering to convert dates is to handle ambiguity, for instance, the number in the sentence "in 2018" is actually year rather than regular number, which translation to other language are different from number (which simply copy & paste).

In fact, I have done experiment by replacing all numbers/dates with placeholders, and in general, the BLEU score improves a lot for some sentences, however, it did worse for the sentences where the named entities are not well recognised.

martinpopel commented 6 years ago

replacing all numbers/dates with placeholders, and in general, the BLEU score improves a lot

And did you reconstructed the numbers back from placeholders before measuring BLEU or did you changed the reference to use placeholders?

You can try using NER to detect numbers representing a year and replace them e.g. with . This could help to improve old-style PB-SMT, but I doubt you will get any improvements in modern NMT with this. Transformer is clever enough to figure out that "in 2018" is a year and translates this correctly (e.g. adding word "year" for target languages where it is required), according to my experience.

lkluo commented 6 years ago

@martinpopel: The BLEU was measured before post-editing, and the reference was also pre-processed (i.e., using placeholders). I shall re-calculate BLEU after post-editing and use original reference to make a fair comparison. I am aware that Transformer can learn "2018" by its context, what I was thinking was to give a little bit prior information for the model so that it could focus on learning other information; in addition, to shorten sentences which leads to a saving of training time, memory, etc. Next step for me to go is to leave dates as what they originally are, and do minimal conversation for numbers (it has to be done because most of my training data contains a lot of numbers, e.g, 200,000,000 as well as dates).

One more interesting thing I found out was that, the subword will convert any word (even unknown) into subpieces. Therefore, the translations never have unknown token (e.g., ) produced. What I did was to validate unknown token before subwords and convert it into a proper placeholder. I am not sure if T2T has alternative to deal with this?

tensorflow / tensor2tensor

Is it a good idea to apply NER for translation #1106