Global-shuffling sample method missing

zkaoudi commented 7 years ago

Very often in ML people that use sample want to do the following: shuffle the entire data and then start reading them sequentially. Once all data is read, shuffle again and read sequentially and so on. Add such an operator.

sekruse commented 7 years ago

I think we need a bit more clarification in this issue: Do we need a shuffle or a shuffle+read operator?

Also, "reading sequentially" is rather ambiguous: Each platform might or might not operate sequentially or in a well-defined order in the first place. Also, Rheem's channels currently do not have a notion of order. Finally, our conversion operators are not guaranteed to be order-aware as of now.

zkaoudi commented 7 years ago

It's shuffle and read. Yes, "reading sequentially" is indeed tricky. The order does not really matter. What is most important is to be able to read the entire dataset after a certain number of iterations. The meaning of "sequentially", eg., for Spark, can be reading each partition sequentially based on their partition id and then reading the tuples of the partition sequentially.

sekruse commented 7 years ago

And just using dataQuanta.mapPartitions(...) does not do the job when the order actually does not matter? And if so, why is that operation insufficient?

zkaoudi commented 7 years ago

We don't want to read the entire dataset at once, but parts of it iteration per iteration. For example, assume we have 100 tuples and a sample size 10 then we want in 10 iterations to have accessed all the data by sampling them each time. A common way to do so is randomly "ordering" the data once and then read them sequentially in batches of 10 per iteration.

sekruse commented 7 years ago

Aha, now I understand your point.

To me this sounds like a rather specific problem. At the same time, having some kind of stateful iterations where you would read only a share of data quanta has quite some implications on the general model of Rheem and adds quite some complexity, I guess.

Therefore, it's worthy a discussion on what level this issue should be addressed:

Application level. One could, e.g., assign data quanta some partition ID and filter the data quanta with the appropriate partition ID in each iteration. An outer loop (not yet supported) might then re-assign partition IDs once this inner loop is done.
Operator level. We might think of adding a complex operator to realize this behavior. Another possibility would be to add stateful iteration operators or iteration-aware operators and data structures.
Core level. We could make Rheem's execution model more complex to support such tasks.

Generally, I am not in favor of (3) because (i) complicated execution semantics might not be supported by many platforms; (ii) it will be harder to extend Rheem; and (iii) it is a lot of coding effort. From that perspective, (1) is the preferable option.

zkaoudi commented 7 years ago

I was thinking more about (2), ie, simply adding new execution operators to have this behaviour. We have sth similar currently eg. in the SparkShufflePartitionSampleOperator.

JorgeQuiane commented 7 years ago

I was also thinking in option (2) as it is not that complex as it looks like and frees developers to do so in their apps: a nice feature to have :)

sekruse commented 7 years ago

Summing this up, this issue proposes a functionality akin to this snippet

planBuilder.load(...)
  .samplingRepeat(
    initialModel, // represented as DataQuanta
    { (sample, model) =>
     // loop code...
    (updatedModel, convergenceCriterion)
  },
  convergenceDataQuanta => convergenceDataQuanta // termination criterion on convergenceCriterion,
  0.01 // sampling factor
)

where model refers to some ML model to be learned and sample refers to a sample of the input data that is disjoint from all previous samples (unless the whole dataset has already been sampled). Is that right?

rheem-ecosystem / rheem

Global-shuffling sample method missing #9