modernmt / modernmt

Neural Adaptive Machine Translation that adapts to context and learns from corrections.
http://www.modernmt.eu/
Apache License 2.0
338 stars 68 forks source link

written-out numbers are shown as ? #387

Closed mzeidhassan closed 6 years ago

mzeidhassan commented 6 years ago

Hi @davidecaroselli ,

I am testing an English>Japanese engine and I found out that all written-out numbers like 'one', 'two', 'three' are shown as question marks in translation.

Here are a couple of examples: 1) EN: The Sales Infolet page holds up to six infolets. JA: 「営業インフォレット」ページでは、最大 ? つのインフォレットが保持されます。

2) EN: Navigator is represented by an icon with four parallel white lines on in the upper left corner of the Home Page. JA: ナビゲータは、ホーム・ページの左上隅にある ? つの並行行を含むアイコンで表されます。

3) EN: Once enabled, you can delete existing analyses, and add new analyses for a total of six per page. JA: 有効にすると、既存の分析を削除し、新規分析を ? ページ当たり合計 ? 個に追加できます。

Moreover, the corrected translations of these strings were added to the training data and I iterated over the engine using these fixed translations. Although, they are "100%" match, still it wasn't rendered correctly.

By the way, I am using 2.3 version of MMT.

Also, many digits are still converted to question marks as well. I believe this is caused by "NumericPlaceholderNormalizer.java". For some reason, you decided to replace all numbers with zeros and this could be causing this issue. Any idea?

BTW, are these issues fixed in the latest release (2.5)?

Thanks in advance for your support!

Mohamed

davidecaroselli commented 6 years ago

Hi @mzeidhassan

Short answer: this has been fixed in release v2.5, but requires a complete re-training.

Long answer: as you correctly found out, this was due the numbers processor component. We thought that masking numbers and then replacing them after the translation, would be lead to better quality thanks to data sparsity reduction. And that was partially true, very true for small/medium training set.

But we suddenly realized that, in a real-case scenario - with a very large training set, this was not the case. Moreover those cases ("one" translated with 1) were impossible to guess by the numbers post-processor.

Eventually we decided to remove this processing step and leave numbers as they are for the training. This change has been pushed into version 2.5, but for this reason, it requires a complete re-training.

Best, Davide

mzeidhassan commented 6 years ago

Hi @davidecaroselli ,

This is very good news. I am training an EN>JA engine now with 2.5 and I will post the results here.

On a different note, is there a way to skip a particular processing step? If yes, what processes can be skipped and how?

Thanks, Mohamed

davidecaroselli commented 6 years ago

Hi @mzeidhassan

In the files preprocessor-default.xml and postprocessor-default.xml you'll find al pre/post processing steps performed during training and translation.

Just edit those files accordingly and you can enable/disable processing steps. Please remember you'll need to rebuild src folder in order to changes to take places.

mzeidhassan commented 6 years ago

Hi @davidecaroselli ,

Just to confirm that so far, numbers are shown correctly in the 2.5 engine 👍 This is great.

The URLs are not preserved though.

For example:

http://www.ktaraghi.blogspot.com/2014/09/what-is-glusterfs-filesystem-and-how.html

is becoming:

http://www.kdtaghihances.com/2014/014/hritech.com/2014/h&husthoo.html

davidecaroselli commented 6 years ago

Hi @mzeidhassan

glad it worked! So now URLs are kept together, the problem here is that the engine decided to "translate" one into the other. If you want to be sure that the string is kept untouched you need to use DoNotTranslate placeholder that will be introduced with the new release (issue #385)

davidecaroselli commented 6 years ago

Hi @mzeidhassan

so, because the main issue "written-out numbers are shown as ?" is solved, I'm closing this issue. Regarding the URL handling, as I suggested look at issue #385 because URLs is a great example on how you should use "DoNotTranslate" tokens.

As always: if you have any other question please don't hesitate to re-open this issue.

Best, Davide