tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.github.io/tfx/
Apache License 2.0
2.12k stars 711 forks source link

TFX taxi template KubeflowDagRunner Transform throws "ModuleNotFoundError" #1696

Closed yantriks-edi-bice closed 4 years ago

yantriks-edi-bice commented 4 years ago

I built a TFX pipeline based on the taxi template. All components run okay when using the BeamDagRunner. When using the KubeflowDagRunner all but Transform run okay. I build the pipeline on Kubeflow notebook, upload the packaged pipeline via UI, or run via KFP.Client().create_run_from_pipeline_package directly.

The Kubeflow Pipelines UI shows all the components but Transform run okay. I see the respective DataFlow jobs in the DataFlow UI. Transform fails with the error below and I do not even see a respective DataFlow job in the DataFlow UI.

INFO:absl:Nonempty beam arg setup_file already includes dependency WARNING:tensorflow:From /tfx-src/tfx/components/transform/executor.py:511: Schema (from tensorflow_transform.tf_metadata.dataset_schema) is deprecated and will be removed in a future version. Instructions for updating: Schema is a deprecated, use schema_utils.schema_from_feature_spec to create a Schema Traceback (most recent call last): File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 382, in main() File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 375, in main execution_info = launcher.launch() File "/tfx-src/tfx/orchestration/launcher/base_component_launcher.py", line 205, in launch execution_decision.exec_properties) File "/tfx-src/tfx/orchestration/launcher/in_process_component_launcher.py", line 67, in _run_executor executor.Do(input_dict, output_dict, exec_properties) File "/tfx-src/tfx/components/transform/executor.py", line 391, in Do self.Transform(label_inputs, label_outputs, status_file) File "/tfx-src/tfx/components/transform/executor.py", line 833, in Transform preprocessing_fn = self._GetPreprocessingFn(inputs, outputs) File "/tfx-src/tfx/components/transform/executor.py", line 775, in _GetPreprocessingFn preprocessing_fn_path_split[-1]) File "/tfx-src/tfx/utils/import_utils.py", line 77, in import_func_from_module user_module = importlib.import_module(module_path) File "/opt/venv/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 994, in _gcd_import File "", line 971, in _find_and_load File "", line 953, in _find_and_load_unlocked ModuleNotFoundError: No module named 'preprocessing'

BEAM_PIPELINE_ARGS = [ '--project=' + GCP_PROJECT_ID, '--runner=DataflowRunner',

'--experiments=shuffle_mode=auto',

'--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
'--setup_file=./setup.py',
#'--requirements_file=./requirements.txt',
'--region=' + GCP_REGION,

]

Here's the setup.py where I attempted to include the preprocessing module explicitly to no avail

import setuptools

setuptools.setup( name='markdowns-tfx-pipeline', version='0.2', py_modules=['configs', 'features', 'hparams', 'model', 'pipeline', 'preprocessing'], install_requires=['tfx'], packages=setuptools.find_packages(), )

numerology commented 4 years ago

Hi @yantriks-edi-bice If I understand it correctly the reason is perhaps in the container entrypoint running on KFP, the setup_file will be overriden by the code here.

Perhaps we can try fixing this by detecting the collision before constructing the beam pipeline args? WDYT @neuromage ?

neuromage commented 4 years ago

Yes, that looks like a bug. @numerology do you mind sending out a fix for ensuring user setup.py takes precedence over the default for Beam?

yantriks-edi-bice commented 4 years ago

Yes @numerology that does look like the reason. I could see from the logs that the TFX setup.py was being used and all things TFX were working as expected but anything from my code (and I believe the Transform component is the first one which refers to anything from my code) was failing.

Now the code you pointed to does give you access to the provided arguments and could easily check if user supplied a setup.py file however it cannot simply override the TFX setup.py one, right? Isn't that required as well, or would it work if user supplies tfx as a requirement in user setup.py file like I do above?

numerology commented 4 years ago

Yes, that looks like a bug. @numerology do you mind sending out a fix for ensuring user setup.py takes precedence over the default for Beam?

Sure

Isn't that required as well, or would it work if user supplies tfx as a requirement in user setup.py file like I do above?

I believe you'll need to declare tfx as a requirement in your setup.py to make sure TFX library works correctly in Dataflow worker as well.

yantriks-edi-bice commented 4 years ago

I applied the patch to my TFX container entry point in my Kubeflow notebook but not seeing the effect. In fact, I'm not sure if my notebook local tfx package is being called at all as I added some other logging info and not seeing it. Dataflow worker may be using an image with tfx in it or installing a clean (not my patched) one but the orchestration part I was expecting to be done locally on the Kubeflow notebook container - is that true?

numerology commented 4 years ago

I applied the patch to my TFX container entry point in my Kubeflow notebook but not seeing the effect.

Did you rebuild and push an container image, and use that in the KubeflowDagRunner? Thanks

yantriks-edi-bice commented 4 years ago

No, I did not but I do see now how I can pass a custom container to KubeflowDagRunner. How is the default tfx container image put together, and what's the best way for me to build a patched one while waiting for this to be merged and an image released?

numerology commented 4 years ago

@yantriks-edi-bice

The Dockerfile of the default TFX container released by TFX team can be found at [1], which is built based on a base image (containing some heavy and common dependencies) [2]. We have a helper script for building image at [3].

TFX official release usually comes at biweek cadence, but we have a nightly release at [3], for example, docker pull tensorflow/tfx:0.22.0.dev20200428 will give you the nightly release image built on April 28.

[1] https://github.com/tensorflow/tfx/blob/master/tfx/tools/docker/Dockerfile [2] https://github.com/tensorflow/tfx/blob/master/tfx/tools/docker/base/Dockerfile [3] https://github.com/tensorflow/tfx/blob/master/tfx/tools/docker/build_docker_image.sh [4] https://hub.docker.com/r/tensorflow/tfx/tags

yantriks-edi-bice commented 4 years ago

Thanks @numerology, I just came across that script [3] now as well. Will give it a try.

google-ml-butler[bot] commented 4 years ago

Are you satisfied with the resolution of your issue? Yes No

yantriks-edi-bice commented 4 years ago

@numerology I finally managed to build a patched image but noticed the fix has been merged into master and there are a few images since the merge. So I modified my Kubeflow pipeline pointing the runner to the tfx:latest image and re-ran only to find the same error message during the Transform step

` tfx_image = "tensorflow/tfx:latest"

runner_config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
    kubeflow_metadata_config=metadata_config,
    tfx_image=tfx_image
)

kubeflow_dag_runner.KubeflowDagRunner(config=runner_config).run(`

Here's the Transform step error again

INFO:absl:Running executor for Transform INFO:absl:Nonempty beam arg setup_file already includes dependency WARNING:tensorflow:From /tfx-src/tfx/components/transform/executor.py:511: Schema (from tensorflow_transform.tf_metadata.dataset_schema) is deprecated and will be removed in a future version. Instructions for updating: Schema is a deprecated, use schema_utils.schema_from_feature_spec to create a Schema Traceback (most recent call last): File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 382, in main() File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 375, in main execution_info = launcher.launch() File "/tfx-src/tfx/orchestration/launcher/base_component_launcher.py", line 205, in launch execution_decision.exec_properties) File "/tfx-src/tfx/orchestration/launcher/in_process_component_launcher.py", line 67, in _run_executor executor.Do(input_dict, output_dict, exec_properties) File "/tfx-src/tfx/components/transform/executor.py", line 391, in Do self.Transform(label_inputs, label_outputs, status_file) File "/tfx-src/tfx/components/transform/executor.py", line 833, in Transform preprocessing_fn = self._GetPreprocessingFn(inputs, outputs) File "/tfx-src/tfx/components/transform/executor.py", line 775, in _GetPreprocessingFn preprocessing_fn_path_split[-1]) File "/tfx-src/tfx/utils/import_utils.py", line 77, in import_func_from_module user_module = importlib.import_module(module_path) File "/opt/venv/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 994, in _gcd_import File "", line 971, in _find_and_load File "", line 953, in _find_and_load_unlocked ModuleNotFoundError: No module named 'preprocessing'

And for comparison here's the CsvExampleGen step output showing a successful Dataflow job with the same pipeline and setup.py beam args - I believe there's something wrong with the Transform component as all the other components work fine. It's possible they succeed as they don't refer to any modules but at least from the logs they do "python setup.py sdist" whereas Transform does not

INFO:root:Executing command: ['/opt/venv/bin/python', 'setup.py', 'sdist', '--dist-dir', '/tmp/tmp6dpegkxd'] INFO:root:Starting GCS upload to gs://kf-poc-edi/tmp/beamapp-root-0428183625-124221.1588098985.124689/workflow.tar.gz... INFO:root:Completed GCS upload to gs://kf-poc-edi/tmp/beamapp-root-0428183625-124221.1588098985.124689/workflow.tar.gz in 0 seconds. INFO:root:Downloading source distribution of the SDK from PyPi INFO:root:Executing command: ['/opt/venv/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmp6dpegkxd', 'apache-beam==2.17.0', '--no-deps', '--no-binary', ':all:']

numerology commented 4 years ago

@yantriks-edi-bice so to isolate the problem do you mind trying the following:

  1. using in-cluster direct runner instead of dataflow runner;
  2. Did you pass the module file as a uri to the Transform component or use the preprocessing_fn?
yantriks-edi-bice commented 4 years ago

@numerology

Here's the complete KubeflowDagRunner.run

` tfx_image = "tensorflow/tfx:latest"

runner_config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
    kubeflow_metadata_config=metadata_config,
    tfx_image=tfx_image
)

kubeflow_dag_runner.KubeflowDagRunner(config=runner_config).run(
#kubeflow_dag_runner.KubeflowDagRunner().run(
    pipeline.create_pipeline(
        pipeline_name=configs.PIPELINE_NAME,
        pipeline_root=PIPELINE_ROOT,
        data_path=DATA_PATH,
        # TODO(step 7): (Optional) Uncomment below to use BigQueryExampleGen.
        # query=configs.BIG_QUERY_QUERY,
        preprocessing_fn=configs.PREPROCESSING_FN,
        trainer_fn=configs.TRAINER_FN,
        train_args=configs.TRAIN_ARGS,
        eval_args=configs.EVAL_ARGS,
        serving_model_dir=SERVING_MODEL_DIR,
        # TODO(step 7): (Optional) Uncomment below to use provide GCP related
        #               config for BigQuery.
        # beam_pipeline_args=configs.BIG_QUERY_BEAM_PIPELINE_ARGS,
        # TODO(step 8): (Optional) Uncomment below to use Dataflow.
        beam_pipeline_args=configs.BEAM_PIPELINE_ARGS,
        # TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
        # ai_platform_training_args=configs.GCP_AI_PLATFORM_TRAINING_ARGS,
        # TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
        # ai_platform_serving_args=configs.GCP_AI_PLATFORM_SERVING_ARGS,
    ))`

and the relevant configs.py snippet

PREPROCESSING_FN = 'preprocessing.preprocessing_fn'

I'll try the in-cluster direct runner next. I guess just comment out the beam_pipeline_args in the code above?

numerology commented 4 years ago

I'll try the in-cluster direct runner next. I guess just comment out the beam_pipeline_args in the code above?

Yep, thanks

numerology commented 4 years ago

/assign @jiyongjung0

To provide inputs regarding Taxi template.

yantriks-edi-bice commented 4 years ago

Failed again with what seems almost the same output (line about non-empty beam args is no longer there)

INFO:absl:Running driver for Transform INFO:absl:MetadataStore with gRPC connection initialized INFO:absl:Adding KFP pod name markdowns-tfx-pipeline-zxqjb-3171395185 to execution INFO:absl:Adding KFP pod name markdowns-tfx-pipeline-zxqjb-3171395185 to execution INFO:absl:Running executor for Transform WARNING:tensorflow:From /tfx-src/tfx/components/transform/executor.py:511: Schema (from tensorflow_transform.tf_metadata.dataset_schema) is deprecated and will be removed in a future version. Instructions for updating: Schema is a deprecated, use schema_utils.schema_from_feature_spec to create a Schema Traceback (most recent call last): File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 360, in main() File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 353, in main execution_info = launcher.launch() File "/tfx-src/tfx/orchestration/launcher/base_component_launcher.py", line 205, in launch execution_decision.exec_properties) File "/tfx-src/tfx/orchestration/launcher/in_process_component_launcher.py", line 67, in _run_executor executor.Do(input_dict, output_dict, exec_properties) File "/tfx-src/tfx/components/transform/executor.py", line 391, in Do self.Transform(label_inputs, label_outputs, status_file) File "/tfx-src/tfx/components/transform/executor.py", line 849, in Transform preprocessing_fn = self._GetPreprocessingFn(inputs, outputs) File "/tfx-src/tfx/components/transform/executor.py", line 791, in _GetPreprocessingFn preprocessing_fn_path_split[-1]) File "/tfx-src/tfx/utils/import_utils.py", line 77, in import_func_from_module user_module = importlib.import_module(module_path) File "/opt/venv/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 994, in _gcd_import File "", line 971, in _find_and_load File "", line 953, in _find_and_load_unlocked ModuleNotFoundError: No module named 'preprocessing'

yantriks-edi-bice commented 4 years ago

FYI, this is how I run the built pipeline via Kubeflow notebook

import kfp run_result = kfp.Client().create_run_from_pipeline_package( configs.PIPELINE_NAME + '.tar.gz', arguments={})

numerology commented 4 years ago

Okay so the problem is actually Transform cannot correctly find the preprocessing function.

Took a quick look at the taxi template code (sorry I was not familiar with that part), given that the Transform (and if you proceed successfully, Trainer) is using a local reference to preprocessing function, you'll have to rebuild and push your image and use that image in KubeflowDagRunner. I think in the template notebook it's done by scaffolding, as mentioned in step 4 in https://github.com/tensorflow/tfx/blob/master/docs/tutorials/tfx/template.ipynb

One way to workaround that, if you do not want to rebuild and push your image, is to put your preprocessing function into a python file (which we call a module file) and upload it to a GCS path. Then you can pass the GCS uri to the module_file parameter of the Transform component. I guess you'll need to do the same thing for Trainer as well.

yantriks-edi-bice commented 4 years ago

Yes, I've seen step 4 before and have actually tried it but I suppose the image was based on the tfx version which ignored the user supplied setup.py file. Will try now supplying tfx:latest to the skaffold Dockerfile before kicking off "tfx pipeline create --build-target-image={CUSTOM_TFX_IMAGE}" which uses skaffold to build CUSTOM_TFX_IMAGE which I suppose I will have to then supply to the KubefloDagRunner config.

The kubeflow_dag_runner.py code says the following which doesn't seem to be the case any more (is this custom image discovery only possible via tfx create&run?)

# This pipeline automatically injects the Kubeflow TFX image if the # environment variable 'KUBEFLOW_TFX_IMAGE' is defined. Currently, the tfx # cli tool exports the environment variable to pass to the pipelines. # tfx_image = os.environ.get('KUBEFLOW_TFX_IMAGE', None)

numerology commented 4 years ago

(is this custom image discovery only possible via tfx create&run?)

I think yes, it's only automatically injected in the taxi template experience. If you look at the basic KubeflowDagRunner it does not have such function.

numerology commented 4 years ago

@jiyongjung0 can provide more details regarding the template experience.

yantriks-edi-bice commented 4 years ago

After building a custom TFX image via tfx pipeline create and using tfx run, the pipeline now fails at the very first component which used to work okay.

Traceback (most recent call last): File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 360, in main() File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 351, in main telemetry_utils.LABEL_TFX_RUNNER: 'kfp', AttributeError: module 'tfx.utils.telemetry_utils' has no attribute 'LABEL_TFX_RUNNER'

It's likely due to my existing Dockerfile which does pip install requirements.txt at the end and that file has an older tfx version 0.21.4. I don't think there's a pip package yet which includes the beam args fix. Also saw somewhere about not using requirements.txt for Dataflow. Renamed Dockerfile and reran tfx pipeline create and this time it generated a Dockerfile with python setup.py install as the last step. The custom TFX image failed with

File "/home/jovyan/.local/lib/python3.6/site-packages/tfx/tools/cli/handler/kubeflow_handler.py", line 291, in _build_pipeline_image skaffold_cmd=skaffold_cmd).build() File "/home/jovyan/.local/lib/python3.6/site-packages/tfx/tools/cli/container_builder/builder.py", line 95, in build image_sha = skaffold_cli.build(buildspec_filename=self._buildspec.filename) File "/home/jovyan/.local/lib/python3.6/site-packages/tfx/tools/cli/container_builder/skaffold_cli.py", line 52, in build buildspec_filename File "/usr/lib/python3.6/subprocess.py", line 356, in check_output **kwargs).stdout File "/usr/lib/python3.6/subprocess.py", line 438, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['skaffold', 'build', '-q', '--output={{json .}}', '-f', 'build.yaml']' returned non-zero exit status 1.

It's possible the setup.py requires part needs to include a lot more required packages than just tfx so I'll do that and see where it gets me.

I feel caught in a roundabout of dependencies and specifications for various stages of this process, from building custom images for the Kubeflow orchestration to the Dataflow workers etc. Appreciate the help though!

numerology commented 4 years ago

It's likely due to my existing Dockerfile which does pip install requirements.txt at the end and that file has an older tfx version 0.21.4.

Actually we have a nightly build wheel which you can install by running !pip3 install --index-url https://test.pypi.org/simple/ tfx==tfx-0.22.0.dev20200504 --user

subprocess.CalledProcessError: Command '['skaffold', 'build', '-q', '--output={{json .}}', '-f', 'build.yaml']' returned non-zero exit status 1.

For this specific error, I think it's possible due to missing scaffold in the current environment. Perhaps you would like to check that as well.

I feel caught in a roundabout of dependencies and specifications for various stages of this process, from building custom images for the Kubeflow orchestration to the Dataflow workers etc.

Indeed, Kubernetes relies on container image to package custom dependencies so that various heterogeneous workloads can all be supported by the distributed workers. This brings up the question (especially for TFX because it's both a library and a SDK) of supporting a portable dependencies/environment resolution at runtime, which is a tricky engineering problem.

We're actively working on improving this department and I think it will be much better soon.

yantriks-edi-bice commented 4 years ago

I installed the latest tfx wheel as you instructed (though it required --extra-index-url https://pypi.org/simple/ in order to find all the dependencies)

skaffold is there (as I had installed) and is found okay:

Reading build spec from build.yaml Use skaffold to build the container image. /home/jovyan/.local/bin/skaffold FATA[0126] build failed: building [gcr.io/saas-ml-dev/markdowns-tfx-pipeline]: cloud build failed: FAILURE No container image is built. Traceback (most recent call last):

Since I'm using Cloud Build in the skaffold build spec, I realized I could look at the build job for any clues, and yes it was a dependency conflict:

Installed /opt/venv/lib/python3.6/site-packages/markdowns_tfx_pipeline-0.2-py3.6.egg Processing dependencies for markdowns-tfx-pipeline==0.2 error: PyYAML 5.3.1 is installed but pyyaml~=3.12 is required by {'kubernetes'} The command '/bin/sh -c python3 setup.py install' returned a non-zero code: 1 ERROR ERROR: build step 0 "gcr.io/cloud-builders/docker" failed: step exited with non-zero status: 1

I repeated this process many times, upgrading some requirements, deleting some - there are quite a lot since the requirements.txt are frozen from the Kubeflow notebook. Eventually got the image and then tons of fun trying to figure out whether it had all my pipeline code and where and why it wasn't being found.

Figured out I had to override the entrypoint in order to be able to shell into it and inspect. Once in the shell it was quick work to add the right steps to the Dockerfile in order to make my pipeline code accessible from the executor code - simply installed it via pip.

Resubmitted the pipeline and all the cached steps got bypassed okay and it is now running the Transform step on DataFlow.

Thanks for all your help!

google-ml-butler[bot] commented 4 years ago

Are you satisfied with the resolution of your issue? Yes No