twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.48k stars 703 forks source link

[Proposal] Support more sinks/sources in scalding-spark #1988

Open daniel-sudz opened 2 years ago

daniel-sudz commented 2 years ago

Is your feature request related to a problem? Please describe. Currently scalding-spark only supports a few scalding sources/sinks. I am interested in adding support for some more common use cases.

The current support is:

  /**
   * This has a mappings for some built in scalding sinks currently only WritableSequenceFile and TextLine are
   * supported
   *
   * users can add their own implementations and compose Resolvers using orElse
   */
  val Default: Resolver[Output, SparkSink] =
    new Resolver[Output, SparkSink] {
      def apply[A](i: Output[A]): Option[SparkSink[A]] =
        i match {
          case ws @ WritableSequenceFile(path, fields, sinkMode) =>
            Some(writableSequenceFile(path, ws.keyType, ws.valueType).asInstanceOf[SparkSink[A]])
          case tl: TextLine =>
            Some(textLine(tl.localPaths.head).asInstanceOf[SparkSink[A]])
          case _ =>
            None
        }
    }
}
johnynek commented 2 years ago

I think this is a good goal, but trying to make the current common inputs independent of cascading will be pretty hard...

Instead, I think you could probably make it work by admitting cascading to the classpath, but excluding Hadoop and just not triggering hadoop in the runtime, since you are only exercising equals and isInstanceOf to do these kinds of matches.