Split a whole data set into train dev and test

tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial

Apache License 2.0

6.39k stars 1.96k forks source link

Split a whole data set into train dev and test #142

Open mingfengwuye opened 7 years ago

mingfengwuye commented 7 years ago

If I have a whole data set, is there a tool could split it into three parts, train dev and test? Thanks

ghost commented 7 years ago

You can do it manually, Always! Anyway, you can use packages like sklearn to split your data into train, test, evaluation (or dev).

kmario23 commented 7 years ago

@mingfengwuye I heavily use commandline and bash to do this. I know it's a cumbersome process. Maybe you can write a shell script to do this more quickly. Anyway, for using new techniques like Subword-NMT, you've to preprocess data accordingly which I think can be done with simple custom shell script.

oahziur commented 7 years ago

@mingfengwuye After you split you data set with either sklearn or bath utils, you can reference this script to see how tokenization and subword is processed with existing tools.

mingfengwuye commented 7 years ago

Thank you all!

ghost commented 7 years ago

You should close the thread!