Real train & eval dataset statistics

dobby-seo commented 3 years ago

Hi~ @memray.

I'm confused something in preprocessing. It shows different to statistics in this and in each training and evaluation phase by max document length. Is that right to differ between statistics of dataset before preprocessing and after preprocessing? Then it requires to show real dataset statistics after preprocessing.

This function drops document longer than max length. https://github.com/memray/OpenNMT-kpg-release/blob/21d0a0081d3b7d9d740dd9f2b49e0f540297e278/preprocess_kp.py#L59

inputter implementations. https://github.com/memray/OpenNMT-kpg-release/blob/21d0a0081d3b7d9d740dd9f2b49e0f540297e278/onmt/inputters/inputter.py#L600

memray commented 3 years ago

Hi @qute012 ,

That notebook is mainly for checking test sets. The stats of the training set are reported using the json file directly, same as the test sets (before preprocessing). So the numbers might be different from the ones actually used in training (due to length constraint etc.).

Best, Rui

dobby-seo commented 3 years ago

Thank you for reply😊

But stats in this notebook doesn't consider length of document. I'm not sure length base on tokens or character, if i drops document longer than 512, it almost abandon more than half. Then don't you drop document on test sets?

memray commented 3 years ago

No, I didn't do any extra process on test data, they are ground-truth and are used as-is. I don't know how much data is actually dropped during training, shouldn't be a large proportion.

memray / OpenNMT-kpg-release

Real train & eval dataset statistics #40