interviewstreet / spark-stratifier

Stratified Cross Validator for Spark
15 stars 8 forks source link

Speed issue with unionAll #1

Open RajarshiBhadra opened 6 years ago

RajarshiBhadra commented 6 years ago

Since the stratifier is using unionAll heavily do you think we might run into speed issues when run on a large volume of data if the sampled training data is subject to processing inside the cross validator?

justinsuen commented 6 years ago

@RajarshiBhadra Yes, that's actually one of the problems we run into when we train on large data sets. For our use cases, we're not training on large enough data sets to run into this problem. However, we're not actively working on this, but we'd love to hear your opinion.

RajarshiBhadra commented 6 years ago

I tried by adding a column that partitions the data within itself using a random number generator logic to create n partitions and instead of union used filtering to make n th group test data while the remaining (n-1) are training data. It worked pretty fast. Let me know your thoughts about it

guzzijones commented 4 years ago

@RajarshiBhadra do you have sample code?