Closed dobby-seo closed 3 years ago
Hi @qute012 ,
That notebook is mainly for checking test sets. The stats of the training set are reported using the json file directly, same as the test sets (before preprocessing). So the numbers might be different from the ones actually used in training (due to length constraint etc.).
Best, Rui
Thank you for reply😊
But stats in this notebook doesn't consider length of document. I'm not sure length base on tokens or character, if i drops document longer than 512, it almost abandon more than half. Then don't you drop document on test sets?
No, I didn't do any extra process on test data, they are ground-truth and are used as-is. I don't know how much data is actually dropped during training, shouldn't be a large proportion.
Hi~ @memray.
I'm confused something in preprocessing. It shows different to statistics in this and in each training and evaluation phase by max document length. Is that right to differ between statistics of dataset before preprocessing and after preprocessing? Then it requires to show real dataset statistics after preprocessing.
This function drops document longer than max length. https://github.com/memray/OpenNMT-kpg-release/blob/21d0a0081d3b7d9d740dd9f2b49e0f540297e278/preprocess_kp.py#L59
inputter implementations. https://github.com/memray/OpenNMT-kpg-release/blob/21d0a0081d3b7d9d740dd9f2b49e0f540297e278/onmt/inputters/inputter.py#L600