twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.49k stars 705 forks source link

Pivoting/Bucketing Sink[T] / Source[T] #654

Open johnynek opened 10 years ago

johnynek commented 10 years ago

I want this:

PivotStore[T,T1,T2](kbij: Injection[T, (T1, T2)],
  pathFn: Injection[T1, Path],
  encoder: Injection[T2, Array[Byte]]) {

  def allBuckets: Iterable[T1]
  def allKeys: TypedSource[T]
  def forPrefixes(ks: Set[T1]): TypedSource[T]
  def toSink: TypedSink[T]
  def where(pred: (T1) => Boolean) = forPrefixes(allBuckets.filter(pred).toSet)
}

The types tell the story: use TemplateTap to bucket T into (T1 prefixed on T2). Write T1 in the Path, write T2 in the on-disk partition.

Could be used where a Key has a time, and we can trim out some of it. Could be used to bucket by time and (user/client/engagement) type. Would be a gigantic perf win to avoid reading all the data on disk.

So huge....

johnynek commented 10 years ago

@azymnis Don't you want to implement this? Could you imagine how cool this would be to unify our approach to DateRange + key-based HDFS bucketing?