Data set partitioning - Githubissues

tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial

Apache License 2.0

6.37k stars 1.96k forks source link

Data set partitioning #337

Closed yapingzhao closed 6 years ago

yapingzhao commented 6 years ago

Hi, I have a problem with data set partitioning: the total parallel corpus is 160,000. What should the training set, validation set and test set be? If the total parallel corpus is 1.2 million. What are the training set, validation set and test set?

Looking forward to your advice or answers. Best regards,

yapingzhao

luozhouyang commented 6 years ago

A typical train, dev and test dataset ratios are 0.7, 0.15, 0.15