twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

[WIP] introduce new scalding-quotation module #1754

Open fwbrasil opened 6 years ago

fwbrasil commented 6 years ago

Problem

Scalding could leverage more information on the user's code to apply optimizations and provide a better description of the jobs.

Solution

Introduce the new module scalding-quotation that provides a mechanism to quote the method parameters through an implicit parameter. Example:

def flatMap[U](f: T => TraversableOnce[U])(implicit q: Quoted): TypedPipe[U]

The quoted implicit is materialized by a macro that has access to the parameters used to invoke the method. This first version has three features:

  1. It provides the class file and line number information. It has the complete path to the source file, so we could use it to render links to the source code.
  2. It extracts the text of the method call, so users have more information on a transformation without having to open the source file.
  3. It detects projections and, if the source has projection support, it can apply them automatically.

Quoted functions

The macro has access only to the code under compilation. For instance, given this code:

def phone(p: Contact) = p.phone
personTypedPipe.map(p => phone(p.contact))

The macro doesn't know the tree of phone, so it must assume that all fields are used and thus the projection is p.contact instead of p.contact.phone. To address this limitation, the user can quote functions:

val phone = Quoted.function {
  (p: Contact) = p.phone
}
personTypedPipe.map(p => phone(p.contact))

Even though the macro doesn't have the projections of phone statically, it falls back to a runtime tree that checks if the function is quoted and determines the proper projection (p.contact.phone in this case).

Quoted propagation

The quotation must be done by the methods that are called by the user code. Any internal transformation done by scalding should use the initial quoted from the user call. For instance, this TypedPipe method:

def distinct(implicit ord: Ordering[_ >: T], m: Quoted): TypedPipe[T] =
    asKeys(ord.asInstanceOf[Ordering[T]], m).sum.keys

calls other quoted methods but it doesn't materialize new Quoted instances for them, reusing the one coming from the user call.

As a safety measure, the macro rejects scalding sources that try to materialize a Quoted since that's something that shouldn't happen in scalding internally. The compilation fails with the following error if a quotation is missing and it's a scalding source file:

The quotation must happen at the level of the user-facing API. Add an implicit q: Quoted to the enclosing method. If that's not possible and the transformation doesn't introduce projections, use Quoted.internal.

As the error message informs, it's possible to use Quoted.internal as a workaround for internal transformations that don't produce projections:

def internalSum(t: TypedPipe[Int]) = {
    implicit val q: Quoted = Quoted.internal
    t.sum
}

The macro uses a simple whitelist to allow Quoted materialization for source file paths that have specific substrings ("test", "example", "tutorial" for now). It's a weak rule, but it seems good enough.

Enabling the automatic projection

The projection is done through an optimization phase that is disabled by default and can be enabled via the scalding.experimental.automatic_projection_pushdown flag. I added the experimental so we can promote the feature later if it's stable.

Note that the automatic projection happens at an earlier phase than the manual projections configuration, so it won't have an effect if users are already using the manual projection pushdown.

New typed pipe description

The previous mechanism that provided descriptions based on stack traces will now output the information from Quoted instead. Example description:

SomeSourceFile:38 map(p => p.name)

Interfaces like DrScalding that render the descriptions will need rework. The description could be long since the user's code can have an arbitrary length.

Backwards compatibility

All user-facing methods need to take the extra implicit Quoted. This makes the changeset binary-incompatible, but it's mostly source-compatible. It'll not compile only if the user has a method call that specifies implicits explicitly. Example:

// TypedPipe method
def distinct(implicit ord: Ordering[_ >: T], m: Quoted): TypedPipe[T]

// will compile
aTypedPipe.distinct

// won't compile anymore
aTypedPipe.distinct(someOrdering)

// user will need to specify the quoted parameter
aTypedPipe.distinct(someOrdering, implicitly[Quoted]) 

Notes

fwbrasil commented 6 years ago

@johnynek I think it could be three PRs:

  1. add the new scalding-quotation module
  2. add the Quoted implicits
  3. introduce automatic projection (optimization rule and TypedSource changes)

@ttim @dieu @benpence wdyt?

isnotinvain commented 6 years ago

IDK if it makes sense to copy, but scalatest calls something similar a Position and may of it's methods take an implicit Position -- seems the projection use case doesn't really fit in something called Position -- I just worry that users are going to have no idea what a Quoted is and if we can come up with something more clear. Is there any value in separating positional info from introspection info (eg field accesses)? if it was called MethodIntrospection it might be more clear what it's for, not a very nice name though.

ttim commented 6 years ago

I have CallSiteInformation in mind, but I also think it's not the best name.

ttim commented 6 years ago

@fwbrasil I think it's a good idea to split it in a way you say.

fwbrasil commented 6 years ago

@isnotinvain @ttim Quotation is a principled concept like Monoid, Monad, etc. I think it's just a question of having documentation

fwbrasil commented 6 years ago

@johnynek @ttim @dieu @isnotinvain I've just created the first PR: https://github.com/twitter/scalding/pull/1755

CLAassistant commented 4 years ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.