Closed yantriks-edi-bice closed 4 years ago
Hi @yantriks-edi-bice If I understand it correctly the reason is perhaps in the container entrypoint running on KFP, the setup_file will be overriden by the code here.
Perhaps we can try fixing this by detecting the collision before constructing the beam pipeline args? WDYT @neuromage ?
Yes, that looks like a bug. @numerology do you mind sending out a fix for ensuring user setup.py takes precedence over the default for Beam?
Yes @numerology that does look like the reason. I could see from the logs that the TFX setup.py was being used and all things TFX were working as expected but anything from my code (and I believe the Transform component is the first one which refers to anything from my code) was failing.
Now the code you pointed to does give you access to the provided arguments and could easily check if user supplied a setup.py file however it cannot simply override the TFX setup.py one, right? Isn't that required as well, or would it work if user supplies tfx as a requirement in user setup.py file like I do above?
Yes, that looks like a bug. @numerology do you mind sending out a fix for ensuring user setup.py takes precedence over the default for Beam?
Sure
Isn't that required as well, or would it work if user supplies tfx as a requirement in user setup.py file like I do above?
I believe you'll need to declare tfx as a requirement in your setup.py to make sure TFX library works correctly in Dataflow worker as well.
I applied the patch to my TFX container entry point in my Kubeflow notebook but not seeing the effect. In fact, I'm not sure if my notebook local tfx package is being called at all as I added some other logging info and not seeing it. Dataflow worker may be using an image with tfx in it or installing a clean (not my patched) one but the orchestration part I was expecting to be done locally on the Kubeflow notebook container - is that true?
I applied the patch to my TFX container entry point in my Kubeflow notebook but not seeing the effect.
Did you rebuild and push an container image, and use that in the KubeflowDagRunner? Thanks
No, I did not but I do see now how I can pass a custom container to KubeflowDagRunner. How is the default tfx container image put together, and what's the best way for me to build a patched one while waiting for this to be merged and an image released?
@yantriks-edi-bice
The Dockerfile of the default TFX container released by TFX team can be found at [1], which is built based on a base image (containing some heavy and common dependencies) [2]. We have a helper script for building image at [3].
TFX official release usually comes at biweek cadence, but we have a nightly release at [3], for example, docker pull tensorflow/tfx:0.22.0.dev20200428
will give you the nightly release image built on April 28.
[1] https://github.com/tensorflow/tfx/blob/master/tfx/tools/docker/Dockerfile [2] https://github.com/tensorflow/tfx/blob/master/tfx/tools/docker/base/Dockerfile [3] https://github.com/tensorflow/tfx/blob/master/tfx/tools/docker/build_docker_image.sh [4] https://hub.docker.com/r/tensorflow/tfx/tags
Thanks @numerology, I just came across that script [3] now as well. Will give it a try.
@numerology I finally managed to build a patched image but noticed the fix has been merged into master and there are a few images since the merge. So I modified my Kubeflow pipeline pointing the runner to the tfx:latest image and re-ran only to find the same error message during the Transform step
` tfx_image = "tensorflow/tfx:latest"
runner_config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
kubeflow_metadata_config=metadata_config,
tfx_image=tfx_image
)
kubeflow_dag_runner.KubeflowDagRunner(config=runner_config).run(`
Here's the Transform step error again
INFO:absl:Running executor for Transform
INFO:absl:Nonempty beam arg setup_file already includes dependency
WARNING:tensorflow:From /tfx-src/tfx/components/transform/executor.py:511: Schema (from tensorflow_transform.tf_metadata.dataset_schema) is deprecated and will be removed in a future version.
Instructions for updating:
Schema is a deprecated, use schema_utils.schema_from_feature_spec to create a Schema
Traceback (most recent call last):
File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 382, in
And for comparison here's the CsvExampleGen step output showing a successful Dataflow job with the same pipeline and setup.py beam args - I believe there's something wrong with the Transform component as all the other components work fine. It's possible they succeed as they don't refer to any modules but at least from the logs they do "python setup.py sdist" whereas Transform does not
INFO:root:Executing command: ['/opt/venv/bin/python', 'setup.py', 'sdist', '--dist-dir', '/tmp/tmp6dpegkxd'] INFO:root:Starting GCS upload to gs://kf-poc-edi/tmp/beamapp-root-0428183625-124221.1588098985.124689/workflow.tar.gz... INFO:root:Completed GCS upload to gs://kf-poc-edi/tmp/beamapp-root-0428183625-124221.1588098985.124689/workflow.tar.gz in 0 seconds. INFO:root:Downloading source distribution of the SDK from PyPi INFO:root:Executing command: ['/opt/venv/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmp6dpegkxd', 'apache-beam==2.17.0', '--no-deps', '--no-binary', ':all:']
@yantriks-edi-bice so to isolate the problem do you mind trying the following:
preprocessing_fn
?@numerology
Here's the complete KubeflowDagRunner.run
` tfx_image = "tensorflow/tfx:latest"
runner_config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
kubeflow_metadata_config=metadata_config,
tfx_image=tfx_image
)
kubeflow_dag_runner.KubeflowDagRunner(config=runner_config).run(
#kubeflow_dag_runner.KubeflowDagRunner().run(
pipeline.create_pipeline(
pipeline_name=configs.PIPELINE_NAME,
pipeline_root=PIPELINE_ROOT,
data_path=DATA_PATH,
# TODO(step 7): (Optional) Uncomment below to use BigQueryExampleGen.
# query=configs.BIG_QUERY_QUERY,
preprocessing_fn=configs.PREPROCESSING_FN,
trainer_fn=configs.TRAINER_FN,
train_args=configs.TRAIN_ARGS,
eval_args=configs.EVAL_ARGS,
serving_model_dir=SERVING_MODEL_DIR,
# TODO(step 7): (Optional) Uncomment below to use provide GCP related
# config for BigQuery.
# beam_pipeline_args=configs.BIG_QUERY_BEAM_PIPELINE_ARGS,
# TODO(step 8): (Optional) Uncomment below to use Dataflow.
beam_pipeline_args=configs.BEAM_PIPELINE_ARGS,
# TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
# ai_platform_training_args=configs.GCP_AI_PLATFORM_TRAINING_ARGS,
# TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
# ai_platform_serving_args=configs.GCP_AI_PLATFORM_SERVING_ARGS,
))`
and the relevant configs.py snippet
PREPROCESSING_FN = 'preprocessing.preprocessing_fn'
I'll try the in-cluster direct runner next. I guess just comment out the beam_pipeline_args in the code above?
I'll try the in-cluster direct runner next. I guess just comment out the beam_pipeline_args in the code above?
Yep, thanks
/assign @jiyongjung0
To provide inputs regarding Taxi template.
Failed again with what seems almost the same output (line about non-empty beam args is no longer there)
INFO:absl:Running driver for Transform
INFO:absl:MetadataStore with gRPC connection initialized
INFO:absl:Adding KFP pod name markdowns-tfx-pipeline-zxqjb-3171395185 to execution
INFO:absl:Adding KFP pod name markdowns-tfx-pipeline-zxqjb-3171395185 to execution
INFO:absl:Running executor for Transform
WARNING:tensorflow:From /tfx-src/tfx/components/transform/executor.py:511: Schema (from tensorflow_transform.tf_metadata.dataset_schema) is deprecated and will be removed in a future version.
Instructions for updating:
Schema is a deprecated, use schema_utils.schema_from_feature_spec to create a Schema
Traceback (most recent call last):
File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 360, in
FYI, this is how I run the built pipeline via Kubeflow notebook
import kfp run_result = kfp.Client().create_run_from_pipeline_package( configs.PIPELINE_NAME + '.tar.gz', arguments={})
Okay so the problem is actually Transform cannot correctly find the preprocessing function.
Took a quick look at the taxi template code (sorry I was not familiar with that part), given that the Transform (and if you proceed successfully, Trainer) is using a local reference to preprocessing function, you'll have to rebuild and push your image and use that image in KubeflowDagRunner. I think in the template notebook it's done by scaffolding, as mentioned in step 4 in https://github.com/tensorflow/tfx/blob/master/docs/tutorials/tfx/template.ipynb
One way to workaround that, if you do not want to rebuild and push your image, is to put your preprocessing function into a python file (which we call a module file) and upload it to a GCS path. Then you can pass the GCS uri to the module_file
parameter of the Transform component. I guess you'll need to do the same thing for Trainer as well.
Yes, I've seen step 4 before and have actually tried it but I suppose the image was based on the tfx version which ignored the user supplied setup.py file. Will try now supplying tfx:latest to the skaffold Dockerfile before kicking off "tfx pipeline create --build-target-image={CUSTOM_TFX_IMAGE}" which uses skaffold to build CUSTOM_TFX_IMAGE which I suppose I will have to then supply to the KubefloDagRunner config.
The kubeflow_dag_runner.py code says the following which doesn't seem to be the case any more (is this custom image discovery only possible via tfx create&run?)
# This pipeline automatically injects the Kubeflow TFX image if the # environment variable 'KUBEFLOW_TFX_IMAGE' is defined. Currently, the tfx # cli tool exports the environment variable to pass to the pipelines. # tfx_image = os.environ.get('KUBEFLOW_TFX_IMAGE', None)
(is this custom image discovery only possible via tfx create&run?)
I think yes, it's only automatically injected in the taxi template experience. If you look at the basic KubeflowDagRunner it does not have such function.
@jiyongjung0 can provide more details regarding the template experience.
After building a custom TFX image via tfx pipeline create and using tfx run, the pipeline now fails at the very first component which used to work okay.
Traceback (most recent call last):
File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 360, in
It's likely due to my existing Dockerfile which does pip install requirements.txt at the end and that file has an older tfx version 0.21.4. I don't think there's a pip package yet which includes the beam args fix. Also saw somewhere about not using requirements.txt for Dataflow. Renamed Dockerfile and reran tfx pipeline create and this time it generated a Dockerfile with python setup.py install as the last step. The custom TFX image failed with
File "/home/jovyan/.local/lib/python3.6/site-packages/tfx/tools/cli/handler/kubeflow_handler.py", line 291, in _build_pipeline_image skaffold_cmd=skaffold_cmd).build() File "/home/jovyan/.local/lib/python3.6/site-packages/tfx/tools/cli/container_builder/builder.py", line 95, in build image_sha = skaffold_cli.build(buildspec_filename=self._buildspec.filename) File "/home/jovyan/.local/lib/python3.6/site-packages/tfx/tools/cli/container_builder/skaffold_cli.py", line 52, in build buildspec_filename File "/usr/lib/python3.6/subprocess.py", line 356, in check_output **kwargs).stdout File "/usr/lib/python3.6/subprocess.py", line 438, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['skaffold', 'build', '-q', '--output={{json .}}', '-f', 'build.yaml']' returned non-zero exit status 1.
It's possible the setup.py requires part needs to include a lot more required packages than just tfx so I'll do that and see where it gets me.
I feel caught in a roundabout of dependencies and specifications for various stages of this process, from building custom images for the Kubeflow orchestration to the Dataflow workers etc. Appreciate the help though!
It's likely due to my existing Dockerfile which does pip install requirements.txt at the end and that file has an older tfx version 0.21.4.
Actually we have a nightly build wheel which you can install by running
!pip3 install --index-url https://test.pypi.org/simple/ tfx==tfx-0.22.0.dev20200504 --user
subprocess.CalledProcessError: Command '['skaffold', 'build', '-q', '--output={{json .}}', '-f', 'build.yaml']' returned non-zero exit status 1.
For this specific error, I think it's possible due to missing scaffold in the current environment. Perhaps you would like to check that as well.
I feel caught in a roundabout of dependencies and specifications for various stages of this process, from building custom images for the Kubeflow orchestration to the Dataflow workers etc.
Indeed, Kubernetes relies on container image to package custom dependencies so that various heterogeneous workloads can all be supported by the distributed workers. This brings up the question (especially for TFX because it's both a library and a SDK) of supporting a portable dependencies/environment resolution at runtime, which is a tricky engineering problem.
We're actively working on improving this department and I think it will be much better soon.
I installed the latest tfx wheel as you instructed (though it required --extra-index-url https://pypi.org/simple/ in order to find all the dependencies)
skaffold is there (as I had installed) and is found okay:
Reading build spec from build.yaml Use skaffold to build the container image. /home/jovyan/.local/bin/skaffold FATA[0126] build failed: building [gcr.io/saas-ml-dev/markdowns-tfx-pipeline]: cloud build failed: FAILURE No container image is built. Traceback (most recent call last):
Since I'm using Cloud Build in the skaffold build spec, I realized I could look at the build job for any clues, and yes it was a dependency conflict:
Installed /opt/venv/lib/python3.6/site-packages/markdowns_tfx_pipeline-0.2-py3.6.egg Processing dependencies for markdowns-tfx-pipeline==0.2 error: PyYAML 5.3.1 is installed but pyyaml~=3.12 is required by {'kubernetes'} The command '/bin/sh -c python3 setup.py install' returned a non-zero code: 1 ERROR ERROR: build step 0 "gcr.io/cloud-builders/docker" failed: step exited with non-zero status: 1
I repeated this process many times, upgrading some requirements, deleting some - there are quite a lot since the requirements.txt are frozen from the Kubeflow notebook. Eventually got the image and then tons of fun trying to figure out whether it had all my pipeline code and where and why it wasn't being found.
Figured out I had to override the entrypoint in order to be able to shell into it and inspect. Once in the shell it was quick work to add the right steps to the Dockerfile in order to make my pipeline code accessible from the executor code - simply installed it via pip.
Resubmitted the pipeline and all the cached steps got bypassed okay and it is now running the Transform step on DataFlow.
Thanks for all your help!
I built a TFX pipeline based on the taxi template. All components run okay when using the BeamDagRunner. When using the KubeflowDagRunner all but Transform run okay. I build the pipeline on Kubeflow notebook, upload the packaged pipeline via UI, or run via KFP.Client().create_run_from_pipeline_package directly.
The Kubeflow Pipelines UI shows all the components but Transform run okay. I see the respective DataFlow jobs in the DataFlow UI. Transform fails with the error below and I do not even see a respective DataFlow job in the DataFlow UI.
INFO:absl:Nonempty beam arg setup_file already includes dependency WARNING:tensorflow:From /tfx-src/tfx/components/transform/executor.py:511: Schema (from tensorflow_transform.tf_metadata.dataset_schema) is deprecated and will be removed in a future version. Instructions for updating: Schema is a deprecated, use schema_utils.schema_from_feature_spec to create a
main()
File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 375, in main
execution_info = launcher.launch()
File "/tfx-src/tfx/orchestration/launcher/base_component_launcher.py", line 205, in launch
execution_decision.exec_properties)
File "/tfx-src/tfx/orchestration/launcher/in_process_component_launcher.py", line 67, in _run_executor
executor.Do(input_dict, output_dict, exec_properties)
File "/tfx-src/tfx/components/transform/executor.py", line 391, in Do
self.Transform(label_inputs, label_outputs, status_file)
File "/tfx-src/tfx/components/transform/executor.py", line 833, in Transform
preprocessing_fn = self._GetPreprocessingFn(inputs, outputs)
File "/tfx-src/tfx/components/transform/executor.py", line 775, in _GetPreprocessingFn
preprocessing_fn_path_split[-1])
File "/tfx-src/tfx/utils/import_utils.py", line 77, in import_func_from_module
user_module = importlib.import_module(module_path)
File "/opt/venv/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'preprocessing'
Schema
Traceback (most recent call last): File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 382, inBEAM_PIPELINE_ARGS = [ '--project=' + GCP_PROJECT_ID, '--runner=DataflowRunner',
'--experiments=shuffle_mode=auto',
]
Here's the setup.py where I attempted to include the preprocessing module explicitly to no avail
import setuptools
setuptools.setup( name='markdowns-tfx-pipeline', version='0.2', py_modules=['configs', 'features', 'hparams', 'model', 'pipeline', 'preprocessing'], install_requires=['tfx'], packages=setuptools.find_packages(), )