I'm currently trying to reproduce your results on unsupervised NMT. I noted that you mentioned you filter out tokenized data with more than 175 tokens. However, I didn't find any code in your data processing file get-data-nmt.sh for doing so.
Can you confirm that the data script is up-to-date?
Also, I use the pretraining script you provided in some issues. I found that the loader in your code would remove long sequences, which is set to 100 sub-tokens for default.
Did you filter out the sequence longer than 175 tokens here?
Hi, thanks for sharing your code.
I'm currently trying to reproduce your results on unsupervised NMT. I noted that you mentioned you filter out tokenized data with more than 175 tokens. However, I didn't find any code in your data processing file get-data-nmt.sh for doing so.
Can you confirm that the data script is up-to-date?
Also, I use the pretraining script you provided in some issues. I found that the loader in your code would remove long sequences, which is set to 100 sub-tokens for default. Did you filter out the sequence longer than 175 tokens here?
Looking forward to your reply. Thanks!