twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Configurable FileSource#pathIsGood Behavior #1570

Open jnievelt opened 8 years ago

jnievelt commented 8 years ago

We're looking to introduce some pathIsGood behavior like the following:

  1. Check the existing class's pathIsGood. If it fails, honor that failure
  2. If the class's pathIsGood passes, also make sure that each path either a) has a _SUCCESS file or b) does not have a _temporary directory

We'd like this behavior to apply to all our sources, especially as we'd like to make some Hadoop changes which lengthen the time between the first and final file completion. Given the variety of ways to create sources (several types, and some Source builders) the simplest thing would be to simply put the logic into FileSource.

Of course, if we always make these checks for every FileSource, there's a good chance that it will break someone (who uses/keeps _temporary directories, etc.). But maybe we could wrap this behavior in some new configuration option to be safe.

I'd like to hear what others think about this problem and about this approach (FileSource logic wrapped by a configuration option). Maybe there's some alternative way (that doesn't, e.g., require creating a parallel set of Source types/builders) to achieve this?

isnotinvain commented 8 years ago

Just for context, at least at Twitter we don't have a base source (something like TwitterSource?) that all our sources extend from. This means if we want to make a change that effects all our sources, we have to make a trait that we mix into each and every source (eg, SuccessFileSource). I would guess that other companies might be in the same position.

So it might be an option to add configuration to FileSource -- maybe not for twitter-specific business logic but for registering an abstract PathIsGood handler or something along those lines.

A similar question would be, how can a company make all their sources default to being SuccessFile sources?

johnynek commented 8 years ago

the source design is definitely the worst part of scalding.

To me, I see this as not really fundamentally fixing the problem.

Actually a Source only has non-abstract method: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/Source.scala#L201

createTap

We could imagine adding a method like:

object TypedSource {
  def build[T: TupleConverter](fn: Mode => Tap[_, _, _]): TypedSource[T] with Mappable[T] = build[T](fn, {_ => Success(())})

  def build[T: TupleConverter](fn: Mode => Tap[_, _, _], validateFn: Mode => Try[Unit]): TypedSource[T] with Mappable[T] = new Source with Mappable[T] {
    override def validate(m: Mode): Unit = validateFn(m).get
    override def createTap(a: AccessMode)(implicit m: Mode) = a match {
      case Read => fn(m)
      case Write => new InvalidSourceTap(Nil)
  }
}

and similarly for TypedSink. Then we could just build normal combinators on these things. You could have:

case class SourceBuilder[T](fn: Mode => Tap[_, _, _], validate: Mode => Try[Unit], conv: TupleConverter[T])

and possibly just write combinators on SourceBuilder (to add validation, etc...)

Let's also note the work @isnotinvain did: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/package.scala#L39

to build composable PathFilters, this seems related. I'd love to see a new (but backwards compatible) way of building sources and sinks. It is certainly the ugliest part of scalding.

isnotinvain commented 8 years ago

Yeah, I would definitely love to come up with a new way of modeling / configuring typed sources in scalding. I think that's worth a discussion. In this particular case, there's also the question of "how do I change global defaults in my code base?"

Eg, even if we make very modular + compose-able sources, we still may need a hook for saying "and by default, add success file checks to all our sources." We can do it through a base trait that all sources in a repo extend from, or via config options, or someting, but it'd be nice to have an API for global defaults of some kind. As is, in the twitter codebase we have many many sources and a solution that involves going to each one and mixing in a new mixin is not going to be great (especially because even after the initial migration, users can forget to do this for new sources)