googledatalab / pydatalab

Google Datalab Library
Apache License 2.0
194 stars 79 forks source link

Add python module reference in transformation section #644

Open rajivpb opened 6 years ago

rajivpb commented 6 years ago

Feedback from @lakshmanok (and this was also in our general longer-term roadmap)

The transformation section currently refers to a SQL query, but eventually we would like this to also refer to a python module. Scenario is as follows:

User creates a bq pipeline that populates a table on a nightly basis, and then wants to create daily reports or visualization via an arbitrary plotting library and logic defined in a previous cell. Would be great if the pipeline config can somehow refer to this and make it happen. This is a little tricky given that the pipeline's DAG is now a little more complicated.

nikhilk commented 6 years ago

You could add an optional extra field to transformation with a python class reference. The thing is you'll also need to have a code module authoring and packaging from notebook functionality.

rajivpb commented 6 years ago

This would actually be a transformation after execution of the current DAG. Given this, how would a python class reference in the transformation work? When Lak and I talked, we discussed a few possibilities:

  1. Having an additional transformation section after the 'output' section (in the cell_body). This would refer to 'post-processing' logic (for after the bq-related steps have completed execution), and could include references to a previously-defined python module that accomplishes the user's visualization logic. Of course, we'd need to also make ordering the transformation sections in yaml as something significant, and that will require further nesting (and more complexity / cognitive load).

  2. (and this goes back to conversations we've had earlier around the interface) Enable users to define individual tasks in cells, and have them stitch these together via a pipeline (not %% bq pipeline, but just %% pipeline) cell.

nikhilk commented 6 years ago

I was envisioning a transformation on the query result to produce the pipeline output. I don't think we should be adding a transformation to the output of the pipeline, because then by definition, the output is no longer the final output of the pipeline.

A function that takes a DataFrame in, and returns a DataFrame out would be interesting. Conceptually this is equivalent to a JavaScript UDF that BigQuery already accepts, but would now allow adding Python to the mix.

All this said, I think this would need more thought -- where does this Python code run? The Airflow worker isn't the best place to do actual work.

Adding unbounded flexibility of course makes this into a general pipeline use-case. It would be interesting to think of whether there are some interesting things to take care of vs. the user just writing their Python code for defining an arbitrary pipeline.

rajivpb commented 6 years ago

Understood and it makes sense to me. Correct, the execution of the Python code would need to be designed. CC: @Di-Ku @lakshmanok