tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.github.io/tfx/
Apache License 2.0
2.11k stars 711 forks source link

add notebook to visualize output of tfx pipeline steps for taxi pipeline #153

Closed Svendegroote91 closed 5 years ago

Svendegroote91 commented 5 years ago

Adding a notebook that uses the output files of the tfx pipelines steps, in the taxi_pipeline example, would help a lot to understand what each steps is exactly producing and how to visualize it.

zhitaoli commented 5 years ago

@rcrowe-google I think you said you want to tackle this? Please reassign if not.

rcrowe-google commented 5 years ago

Do the notebooks in the developer tutorial (https://www.tensorflow.org/tfx/tutorials/tfx/workshop) serve this need, or do we need more?

Svendegroote91 commented 5 years ago

Thanks, that already helps when using Airflow. However, I am currently running the Kubeflow pipeline example and in this case the files are stored on GCS bucket + ML metdata in the mysql instance of the KFP cluster. It would help if there is a notebook that hooks into this information to perform similar kind of steps as in the workshop. It does not necessary need to be a separate notebook, a Readme section on kubeflow and the equivalent commands of

def _make_default_sqlite_uri(pipeline_name):
    return os.path.join(os.environ['HOME'], 'airflow/tfx/metadata', pipeline_name, 'metadata.db')

def get_metadata_store(pipeline_name):
    return tfx_utils.TFXReadonlyMetadataStore.from_sqlite_db(_make_default_sqlite_uri(pipeline_name))

would already make a big difference.

Update: I came across this notebook on the pipelines repo which seems like a fair starting point to connect to the ML metadata store: https://github.com/kubeflow/pipelines/blob/master/samples/tfx-oss/TFX%20Example.ipynb

It might be good to cross-reference as it will help people to discover their results when using kubeflow orchestrator instead of airflow.

zhitaoli commented 5 years ago

@Svendegroote91, unfortunately the Kubeflow based pipeline implementation has not emitted metadata into the database as the airflow example. @neuromage is tracking that work.

Once the artifact is tracked in the metadata, I think we can simply adapt the notebook example pointed by Robert to also work for Kubeflow.

neuromage commented 5 years ago

@Svendegroote91 Pipelines tracks ml metadata as output by the pipeline, but we lack tracking of execution types (and therefore, we lack lineage tracking). As a result, the notebook visualizations won't work well in KFP right now. However, we are working on improving this story and having parity with Airflow for all metadata tracking. We expect to have something working by end of June.

neuromage commented 5 years ago

Just wanted to update this issue and point out that Kubeflow Pipelines now tracks both execution and artifact metadata, so the visualizations should just work. Closing this issue. But feel free to re-open if there are any outstanding issues about this.