Open zkaoudi opened 7 years ago
I think we need a bit more clarification in this issue: Do we need a shuffle or a shuffle+read operator?
Also, "reading sequentially" is rather ambiguous: Each platform might or might not operate sequentially or in a well-defined order in the first place. Also, Rheem's channels currently do not have a notion of order. Finally, our conversion operators are not guaranteed to be order-aware as of now.
It's shuffle and read. Yes, "reading sequentially" is indeed tricky. The order does not really matter. What is most important is to be able to read the entire dataset after a certain number of iterations. The meaning of "sequentially", eg., for Spark, can be reading each partition sequentially based on their partition id and then reading the tuples of the partition sequentially.
And just using dataQuanta.mapPartitions(...)
does not do the job when the order actually does not matter? And if so, why is that operation insufficient?
We don't want to read the entire dataset at once, but parts of it iteration per iteration. For example, assume we have 100 tuples and a sample size 10 then we want in 10 iterations to have accessed all the data by sampling them each time. A common way to do so is randomly "ordering" the data once and then read them sequentially in batches of 10 per iteration.
Aha, now I understand your point.
To me this sounds like a rather specific problem. At the same time, having some kind of stateful iterations where you would read only a share of data quanta has quite some implications on the general model of Rheem and adds quite some complexity, I guess.
Therefore, it's worthy a discussion on what level this issue should be addressed:
Generally, I am not in favor of (3) because (i) complicated execution semantics might not be supported by many platforms; (ii) it will be harder to extend Rheem; and (iii) it is a lot of coding effort. From that perspective, (1) is the preferable option.
I was thinking more about (2), ie, simply adding new execution operators to have this behaviour. We have sth similar currently eg. in the SparkShufflePartitionSampleOperator.
I was also thinking in option (2) as it is not that complex as it looks like and frees developers to do so in their apps: a nice feature to have :)
Summing this up, this issue proposes a functionality akin to this snippet
planBuilder.load(...)
.samplingRepeat(
initialModel, // represented as DataQuanta
{ (sample, model) =>
// loop code...
(updatedModel, convergenceCriterion)
},
convergenceDataQuanta => convergenceDataQuanta // termination criterion on convergenceCriterion,
0.01 // sampling factor
)
where model
refers to some ML model to be learned and sample
refers to a sample of the input data that is disjoint from all previous samples (unless the whole dataset has already been sampled). Is that right?
Very often in ML people that use sample want to do the following: shuffle the entire data and then start reading them sequentially. Once all data is read, shuffle again and read sequentially and so on. Add such an operator.