tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.github.io/tfx/
Apache License 2.0
2.11k stars 709 forks source link

Kubeflow 1.0RC4 metadata config fails #1287

Closed valeriano-manassero closed 4 years ago

valeriano-manassero commented 4 years ago

Kubernetes: 1.15 Kubeflow: 1.0RC4 TFX: 0.21.0

While testing

taxi_pipeline_kubeflow_local.py

I got:

Traceback (most recent call last):
  File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 371, in <module>
    main()
  File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 345, in main
    _get_metadata_connection_config(kubeflow_metadata_config))
  File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 93, in _get_metadata_connection_config
    kubeflow_metadata_config.grpc_config)
  File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 110, in _get_grpc_metadata_connection_config
    kubeflow_metadata_config.grpc_service_host)
TypeError: None has type NoneType, but expected one of: bytes, unicode

While in the past TFX versions I had issues described in https://github.com/tensorflow/tfx/issues/1002 , now TFX is getting metadata config via grpc but it's not getting the configs expected (maybe Kubeflow new version is also involved).

numerology commented 4 years ago

Unfortunately that's a known issue. Kubeflow full-fledge deployment does not have the right MLMD config to use gRPC as in TFX 0.21.0. There are two solution to this issue:

  1. Can you try a standalone KFP deployment (this is the only thing you need to run TFX pipeline, if you do not use Kubeflow notebook, katib and so on) with version >= 0.2.1? You can find deploy instruction here

  2. We can work out a kubeflow_metadata_config that works with full fledge kubeflow deployment, might take 1 or 2 days.

valeriano-manassero commented 4 years ago

Hi @numerology and ty for answer. Unfortunately Katib is a requirement for this testing deployment so I can't avoid it. Atm I'm not sure I have enough time to deep dive into code to issue a PR. Will try if you will not have an implementation before.

nielsgroen commented 4 years ago

Does this block the use of tfx with Kubeflow Pipelines only on local clusters, or also on GCP etc.?

Could you guys perhaps give an indication on the priority of this issue? It would certainly help with decisions going forward on the use of tfx with kubeflow and considering possible alternatives. Many thanks!

lipinski commented 4 years ago

To solve the issue, you should change the configuration:

metadata_config = kubeflow_dag_runner.get_default_kubeflow_metadata_config()
metadata_config.mysql_db_service_host.value = 'mysql.kubeflow'
metadata_config.mysql_db_service_port.value = "3306"
metadata_config.mysql_db_name.value = "metadb"
metadata_config.mysql_db_user.value = "root"
metadata_config.mysql_db_password.value = ""
metadata_config.grpc_config.grpc_service_host.value ='metadata-grpc-service'
metadata_config.grpc_config.grpc_service_port.value ='8080'

runner_config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
    kubeflow_metadata_config=metadata_config
)
valeriano-manassero commented 4 years ago

To solve the issue, you should change the configuration:

metadata_config = kubeflow_dag_runner.get_default_kubeflow_metadata_config()
metadata_config.mysql_db_service_host.value = 'mysql.kubeflow'
metadata_config.mysql_db_service_port.value = "3306"
metadata_config.mysql_db_name.value = "metadb"
metadata_config.mysql_db_user.value = "root"
metadata_config.mysql_db_password.value = ""
metadata_config.grpc_config.grpc_service_host.value ='metadata-grpc-service'
metadata_config.grpc_config.grpc_service_port.value ='8080'

runner_config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
    kubeflow_metadata_config=metadata_config
)

I can confirm this workaround is good for Kubeflow 1.0 on premise. ty!

valeriano-manassero commented 4 years ago

After some testing I see grpc config should be enough, at least I didn't notice any issues with this:

metadata_config = kubeflow_dag_runner.get_default_kubeflow_metadata_config()
metadata_config.grpc_config.grpc_service_host.value ='metadata-grpc-service'
metadata_config.grpc_config.grpc_service_port.value ='8080'

runner_config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
    kubeflow_metadata_config=metadata_config
)
AlexandrePieroux commented 4 years ago

I'd like to add that if your pod is running in a different namespace, you need to append the namespace of the grpc backend to the grpc host name:

metadata_config = kubeflow_dag_runner.get_default_kubeflow_metadata_config() metadata_config.grpc_config.grpc_service_host.value ='metadata-grpc-service.kubeflow' metadata_config.grpc_config.grpc_service_port.value ='8080'

runner_config = kubeflow_dag_runner.KubeflowDagRunnerConfig( kubeflow_metadata_config=metadata_config )

For instance.

google-ml-butler[bot] commented 4 years ago

Are you satisfied with the resolution of your issue? Yes No