Open RajarshiBhadra opened 6 years ago
@RajarshiBhadra Yes, that's actually one of the problems we run into when we train on large data sets. For our use cases, we're not training on large enough data sets to run into this problem. However, we're not actively working on this, but we'd love to hear your opinion.
I tried by adding a column that partitions the data within itself using a random number generator logic to create n partitions and instead of union used filtering to make n th group test data while the remaining (n-1) are training data. It worked pretty fast. Let me know your thoughts about it
@RajarshiBhadra do you have sample code?
Since the stratifier is using unionAll heavily do you think we might run into speed issues when run on a large volume of data if the sampled training data is subject to processing inside the cross validator?