mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.58k stars 549 forks source link

BERT eval set contains 60 empty articles #473

Open matthew-frank opened 3 years ago

matthew-frank commented 3 years ago

PR https://github.com/mlcommons/training/pull/435 contains a script, cleanup_scripts/separate_test_set.py that is used to randomly extract articles from the training set for use as an evaluation set. A total of 10000 articles are extracted from the training set into the eval set. Unfortunately there's a bug in seperate_test_set.py that causes 60 of the 10000 extracted articles to be empty.

The seperate_test_set.py script was later adopted in PR https://github.com/mlcommons/training/pull/470, as the file input_preprocessing/seperate_test_set.py, so it needs to be fixed there as well.

The problem is that on line 75 (here in PR 435 and here in PR 470 ) the boundaries between articles are found by using pythons split('\n\n'). But this produces an empty entry at the end of the resulting array. Then when a random array entry is selected on line 79 there's an approximately 1/2 of 1% chance of selecting the last (empty) entry.

Two choices of fixing the problem would be (a) call pop() on line 75 or (b) change line 79 num_articles to (num_articles-1) (so that the last entry can't be selected).

johntran-nv commented 1 year ago

@sgpyc what do you think?

itayhubara commented 1 year ago

The fix is straightforward but recreating the eval set would require: (1) updating google drive and (2) checking that RCPs are not affected. Since this benchmark is pretty old I think we should keep it as is and add this as a known bug to the documentation.

matthew-frank commented 1 year ago

Yes, unfortunately this was left unfixed in the churn before v1.0 submission 1.5 years ago, so it is what it is, and it would not be productive to change the benchmark at this point. I'll submit a PR adding a small note to the bottom of the README.md for the benchmark.