Language model dataset preparation - Githubissues

mlcommons / training

Reference implementations of MLPerf™ training benchmarks

https://mlcommons.org/en/groups/training

Apache License 2.0

1.57k stars 549 forks source link

Language model dataset preparation #641

Closed mahmoodn closed 1 year ago

mahmoodn commented 1 year ago

Hi, In the language model, the readme file, states

Each of part-00xxx-of-00500 and eval.txt contains one sentence of an article in one line and different articles separated by blank line.

As I check the generated files, I see:

$ wc -l results/part-00000-of-00500
198379 results/part-00000-of-00500
$ wc -l results/eval.txt 
267566 results/eval.txt

I am confused about that. Is that a correct output?