Question about data processing in Unsupervised NMT

Hi, thanks for sharing your code.

I'm currently trying to reproduce your results on unsupervised NMT. I noted that you mentioned you filter out tokenized data with more than 175 tokens. However, I didn't find any code in your data processing file get-data-nmt.sh for doing so.

Can you confirm that the data script is up-to-date?

Also, I use the pretraining script you provided in some issues. I found that the loader in your code would remove long sequences, which is set to 100 sub-tokens for default. Did you filter out the sequence longer than 175 tokens here?

Looking forward to your reply. Thanks!

microsoft / MASS

Question about data processing in Unsupervised NMT #171