k2-fsa / libriheavy

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
Apache License 2.0
172 stars 10 forks source link

Some question about datasets? #8

Open fengshi-cherish opened 6 months ago

fengshi-cherish commented 6 months ago

What's the difference of test-clean and test-clean large(same question about test-other)?

pkufool commented 6 months ago

No difference, just larger. We guarantee that the test subsets don't have overlap books/speakers with training set, so we can't put them into training set, we don't want to waste this part of data, so release them too, in case someone want to test their models in a larger test set.

fengshi-cherish commented 6 months ago

so i just need download all json file in run.sh instead of run_pipeline.sh? And large.tar in run_pipeline.sh include large.json(in run.sh) and test_clean_large.json? Test_clean has no overlap with test_clean_large?