Open mingfengwuye opened 7 years ago
You can do it manually, Always!
Anyway, you can use packages like sklearn
to split your data into train, test, evaluation (or dev).
@mingfengwuye I heavily use commandline and bash to do this. I know it's a cumbersome process. Maybe you can write a shell script to do this more quickly. Anyway, for using new techniques like Subword-NMT, you've to preprocess data accordingly which I think can be done with simple custom shell script.
@mingfengwuye After you split you data set with either sklearn
or bath utils, you can reference this script to see how tokenization and subword is processed with existing tools.
Thank you all!
You should close the thread!
If I have a whole data set, is there a tool could split it into three parts, train dev and test? Thanks