salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 395 forks source link

How can I get the trainset, validationset and holdoutset that modelselectors use internally? #400

Closed liuxiaodong008008 closed 4 years ago

tovbinm commented 5 years ago

I don't believe we expose those datasets. But you can recreate them by applying split and validationPrepare methods on splitter/datacutter instances with training datasets.

https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/stages/impl/tuning/Splitter.scala

liuxiaodong008008 commented 5 years ago

split returns (dataTrain, dataTest). validationPrepare does data balancing or dropping based on the labels before splitting dataTrain into trainset and validationset. So, how to split dataTrain into trainset and validationset?

leahmcguire commented 5 years ago

if you set the seed on the splitter the dataTest returned will be the holdout dataset and the dataTrain will be the training and validation data. Separating out the training and validation data for examination is not really possible since it will depend on the validation method used (eg cross validation or training split) and this is only done internal to the model selector call.