What is exactly happening during pre-processing phase?

mzeidhassan commented 6 years ago

Hi @davidecaroselli ,

Can you please clarify what happens during the pre-processing phase? Are all punctuation marks including question marks, exclamation marks, periods, etc. removed during this step and only clean data is used during training?

If this is the case, I believe the translation of both strings below would not change?

are there any special schemes for getting discounts?
are there any special schemes for getting discounts

I tested it against your online engines and I got 2 different translations:

With question mark, I get: existe-t-il des régimes spéciaux pour obtenir des rabais?

Without it, I get: y a-t-il des régimes spéciaux pour obtenir des rabais

It seems both conveys the same meaning, but use different terms.

If you capitalize the first letter of the English string and strip the question mark, you will get: Existe-t-il des régimes spéciaux pour obtenir des rabais

If you make it all lower-case, you will get this:

y a-t-il des régimes spéciaux pour obtenir des rabais

Unfortunately, I don't know French, so I am not sure who different they are, but they are different. Maybe, you will get the same results if you test it with Italian.

If you put the same English string in Google Translate, it doesn't matter if there is a question mark or not.

Actually, if you add a question mark to the English, Google Translate will just capitalize the first French word, but the translation is the same.

Another example:

EN: can a family get covered in floater top up policy

If you test it in your EN-FR engine, you will see different results if you capitalize the first letter 'Can'.

With small letter, you will get: peut être couvert par une famille dans la politique du haut de la flotte

With 'Can', you will get: Une famille peut-elle être couverte dans la politique de haut niveau

Any idea why?

Do you need to integrate an NLP library like Spacy in MMT to do some pre-processing before sending the source string to the decoder? Not sure, but I would like to understand what is causing this issue.

Thanks, Mohamed

davidecaroselli commented 6 years ago

Hi @mzeidhassan

in general here you can find all the pre-processing steps executed in the pipeline: preprocessor-default.xml.

On the punctuation: we do not strip any punctuation, and it would have been a problem if we did otherwise. In many languages (including Italian for example) the same exact sentence can be interrogative or affermative depending on the presence or not of the '?' at the end of the sentence.

I asked a native French speaker to comment the two forms "existe-t-il des" and "y a-t-il des": both are correct, and they are interrogative forms (in other words, both translate "are there"). The point here is that the second sentence, "are there any special schemes for getting discounts", is still interrogative even if there is an error: missing '?' at the end of line.

Why they use different terms? They are different sentences and the second one is definitely rare (because contains an error). The result is just the inference of the neural network we use.

Even Google Translate on other examples changes the result depending on the question mark in the source sentence:

EN "Is this an example?" > IT "Questo è un esempio?" EN "Is this an example" > IT "È un esempio" EN "is this an example?" > IT "è questo un esempio?" EN "is this an example" > IT "è questo un esempio"

On the lowercase/uppercase: we do not apply any transformation on the letter casing. So "Can" and "can" are two different words; Neural Networks are often very smart in understanding that the meaning of the two words are very close. Sometimes however this can result in small probabilities changes that produce two different translations with the same meaning.

I hope I've answered your questions, please let me know if I can help you further!

Cheers, Davide

mzeidhassan commented 6 years ago

Thanks @davidecaroselli for your answer.

I thought it's part of the tokenization and normalization process to at least lower-case all words. I think this explains why the difference is happening if the first letter is upper-, or lower-case.

I need to do some more testing with normalized dataset and see if there is a difference in terms of quality. I will report back here when I have some free time. I am closing this issue for now.

davidecaroselli commented 6 years ago

Because we don't have a re-caser model, we decided to keep casing as it is and let the NN to understand words similarities.

Thanks for your support as always and let us know if you find some interesting results from your tests.

Best, Davide

mzeidhassan commented 6 years ago

Thanks @davidecaroselli! Can you use the Moses re-caser? https://github.com/moses-smt/mosesdecoder/tree/master/scripts/recaser? It seems that truecaser.perl is being used in some other NMT systems. Please see here

https://marian-nmt.github.io/examples/mtm2017/intro/

and here http://homepages.inf.ed.ac.uk/rsennric/nmt-lab-eacl2017.pdf

It's interesting what's mentioned on the recaser in this research paper as well

http://blog.systransoft.com/wp-content/uploads/2016/10/SystranNMTReport.pdf

modernmt / modernmt

What is exactly happening during pre-processing phase? #359