salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

The effect of random seeds on results ? #559

Closed shenzgang closed 5 months ago

shenzgang commented 3 years ago

When I use Titan tests I get different and very different estimates each time. Does random seeding have that much of an impact?

tovbinm commented 3 years ago

Yes, indeed. In order to get a predictable behavior you can set random seed in your tests. Depending on your tests structure where you set the seed might vary. For example - https://github.com/salesforce/TransmogrifAI/blob/master/helloworld/src/main/scala/com/salesforce/hw/titanic/OpTitanic.scala#L50,

shenzgang commented 3 years ago

Thanks for your reply! There is also the question of how to use the generated model to predict unlabeled test sample data. Are there any examples of using model prediction?

tovbinm commented 3 years ago

You can save a trained model, then load it later, set a new scoring reader / a new input dataset, and finally compute scores by invoking score().

You can also use transmogrifai-local for on-line serving of your model (e.g over HTTP API)

shenzgang commented 3 years ago

The data set used for model training is labeled column, while the test data is not labeled column. When calling score(), an exception will be thrown

leahmcguire commented 3 years ago

You will need to create an empty label column