tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.org/tfx
Apache License 2.0
2.11k stars 707 forks source link

TFX 1.0.0-rc1 Issues #3849

Closed dhruvesh09 closed 2 years ago

dhruvesh09 commented 3 years ago

Please comment if you find any issues with TFX 1.0.0-rc1

Thanks

rthadur commented 3 years ago

cc @arghyaganguly

vaskozl commented 3 years ago

https://github.com/tensorflow/tfx/commit/baccb6bc9650aa8f69548e8ca68473e83606f10b breaks kubeflow_dag_runner.KubeflowDagRunner(config=runner_config).run(tfx_pipeline) in a CI/CD application where the CI is not configured with production bucket access due to component._resolve_pip_dependencies trying to create the pipeline root.

I personally include the module_file in the docker image and install any deps on top, so I think this behaviour should be optional and bucket access should not be required to build the pipeline.

Edit: This is already the case in 0.30.0

vaskozl commented 3 years ago

Something else I'm finding that the Transform now OOMs with relatively small datasets with the default statistics under DirectRunner (have not tested with other beam runners yet).

I'm having to set disable_statistics=True.

chris-r-99 commented 3 years ago

Hi, we have a similar problem to what @vaskozl mentioned about the kubeflow_dag_runner. We store the all code (transform, trainer and custom components) in a Docker image to guarantee code versioning etc.. I want to provide paths to code in my docker container that don't exist in my build pipeline when running the kubeflow_dag_runner.KubeflowDagRunner. As a quick fix for the bucket access, you can just set pipeline_root = None during the build process and provide a different pipline_root as runtime parameter in kubeflow. However, the paths provided to the transform/trainer need to exist when running kubeflow_dag_runner.KubeflowDagRunner which means I can't provide paths that only exist in my docker container. My quick fix right now is to change the paths provided in the YAML file after running the KubeflowDagRunner. It would be nice to turn off the functionally that it packs them as a pip wheel for execution. If there is another way to do it please let me know.

jiyongjung0 commented 3 years ago

@charlesccychen Could you take a look on above issue?

ConverJens commented 3 years ago

I'm hitting a Beam error that I have not seen in either TFX 0.28.0 or 0.29.0 (don't know about 0.30.0). This happens with Beam 2.29.0 and 2.30.0 with DirectRunner, haven't tested any other.

I get:

E0611 13:13:32.522413537    1240 chttp2_transport.cc:1081]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"

Has anyone else seen this?

vaskozl commented 3 years ago

I'm getting the "ENAHCE_YOUR_CALM" when using DirectRunner on 2.29.0 with multi_threading. (The error doesn't appear in the in_memory as afaik it doesn't use the GRPC communication ).

ConverJens commented 3 years ago

@vaskozl You're right! From version 0.29.0 and onwards this seems to be an issue with Direct runner and multiprocessing. in_memory works but is of course so much slower.

Did you find a fix/work around for this?

1025KB commented 3 years ago

+Kyle Weaver @.***>

On Mon, Jun 14, 2021 at 6:54 AM ConverJens @.***> wrote:

@vaskozl https://github.com/vaskozl You're right! From version 0.29.0 and onwards this seems to be an issue with Direct runner and multiprocessing. in_memory works but is of course so much slower.

Did you find a fix/work around for this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tfx/issues/3849#issuecomment-860703485, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALERO3J4IR73HBCAQQYETQDTSYCY5ANCNFSM45STUX6Q .

vaskozl commented 3 years ago

@ConverJens

Sadly not, I've been running with FlinkRunner and DirectRunner in_memory exclusively,

ConverJens commented 3 years ago

@vaskozl That's a bummer! Thanks for letting me know.

axeltidemann commented 3 years ago

Upgrading from TFX 0.27.0, I see that the resulting YAML files are much larger. There is a --tfx_ir element (which I think is a TFX Intermediate Representation) in the YAML which specifies the entire pipeline for each component, i.e. repeating the specification many times.

For comparison, here is the YAML for the TFX 0.27.0 pipeline, which totals 1.7K lines of code. The same pipeline YAML in TFX 1.0.0-rc1 is a whopping 63K lines, which eventually makes the YAML file too big, as shown by the following error message:

Screenshot 2021-06-15 at 13 50 00

Is there a way to disable the repeated entry of the tfx_ir?

RossKohler commented 3 years ago

@vaskozl @ConverJens I've seen this issue in my own pipeline and noticed it was because of a dependency requiring TensorFlow 2.5.0. In my case, it was an unpinned tensorflow_text dependency. Ensuring the TensorFlow version wasn't being updated from TensorFlow 2.4.* (I.e pinning tensorflow_text to 2.4.3) fixed the issue for me. This also only seemed to happen to me in my Evaluator and BulkInferrer components though.

ConverJens commented 3 years ago

@axeltidemann This has been an issue since TFX switched to IR representation for me: https://github.com/tensorflow/tfx/issues/3459.

@1025KB I think this is quite a bad issue since it prevents the TFX taxi showcase pipeline from being run in KubeFlow, i.e. really the boiler plate components.

ConverJens commented 3 years ago

@RossKohler Are you referring to the "Enhance your calm" issue? Latest TFX depends on Tensorflow 2.5... I will try to pin TF text and see if that helps.

RossKohler commented 3 years ago

@ConverJens yes. TensorFlow 2.5 was the issue for me.

ConverJens commented 3 years ago

@RossKohler Just tried with TF 2.4.1 and TF Text 2.4.3 but the issue remains for me.

RossKohler commented 3 years ago

@ConverJens the only other suggestion I have would be to look out for any dependency issues while your image is installing requirements. Other than that, I'm out of ideas:(

ConverJens commented 3 years ago

@RossKohler I don't think dependency issues are the cause of this but thanks for the tip!

@ibzib Do you have any ideas about the cause of this?

deep-diver commented 3 years ago

Hi, I have tried out TFX CLI for Vertex AI like below 👇

!tfx pipeline create  \
--pipeline-path=kubeflow_v2_runner.py \
--engine=vertex \
--build-image

then it split out an error requests.exceptions.HTTPError: 404 Client Error. The full error message is like below, and vp11 is the name of pipeline.

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.40/distribution/gcr.io/silken-psyche-312112/vp11/json

It looks like somehow TFX can not handle to push an image to the GCR. What do you think?

deep-diver commented 3 years ago

I just found a bug when copy taxi template. The included Dockerfile's base image is FROM tensorflow/tfx:1.0.0-rc1, but there is no such thing. Instead, if I replace it with FROM tensorflow/tfx:1.0.0rc1, then it works.

I want to suggest this issue in a PR, what can I do?

jiyongjung0 commented 3 years ago

@deep-diver Thank you for the report and suggestion!!

It seems like an old issue that existed for all rc releases. The tag name is generated in https://github.com/tensorflow/tfx/blob/70d1bb9ff09189316f010e165da742f84e8c73c3/tfx/tools/cli/container_builder/labels.py#L20 and the problem is that the version name pattern is not consistent for RCs. We might be able to add some special handling for RC version around https://github.com/tensorflow/tfx/blob/70d1bb9ff09189316f010e165da742f84e8c73c3/tfx/tools/cli/container_builder/dockerfile.py#L59-L62 .

deep-diver commented 3 years ago

@jiyongjung0 thanks for the quick response! so, this is the only issue for the RCs, and I don't need to PR about this problem?

also, I have an additional question related to Vertex AI. in the legacy AI Platform, I was able to see the visual results after StatisticsGen, ExampleValidator, Evaluator components are done, but it seems not for Vertex AI platform.

jiyongjung0 commented 3 years ago

Right. So it would be nice to fix, but not super critical. :)

For the visualization, I think that you can inspect output artifacts in the notebooks or colabs. We are exploring several approaches for smoother integration, but there is nothing to share at this moment yet.

ibzib commented 3 years ago

@ConverJens regarding the ENHANCE_YOUR_CALM error (btw that is a funny error message):

in_memory works but is of course so much slower.

As soon as speed becomes an issue, I would recommend using Flink or Spark instead, since the direct runner isn't optimized for performance.

ConverJens commented 3 years ago

@ibzib It is a funny error message, the only issue is that I'm bad at enhancing my calm ;)

This issue did not happen with TFX 0.28.0 and beam 2.28 or 2.29. When bumping to TFX 1.0.0rc1 this started to occur for both beam 2.29 and 2.30.

Some of the processes crashes and pipelines fails ti recover and hangs until I terminate it.

I've had it set to 16. I just started a test run with num workers set to 4 and it failed in the same way.

I've switched to Flink runner now and although it is much faster at computations, it comes with the additional operational cost of maintaing Flink as well as random hangs and errors on the Flink side. So for quick iteration during the development phase, It would be very beneficial to have the direct runner working.

This is the full error message:

E0617 08:33:25.201831822     398 chttp2_transport.cc:1117]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
Exception in thread beam_control_read:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/worker_handlers.py", line 117, in _read
    for data in self._input:
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 391, in __next__
    return self._next()
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 383, in _next
    request = self._look_for_request()
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 367, in _look_for_request
    _raise_rpc_error(self._state)
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 113, in _raise_rpc_error
    raise rpc_error
grpc.RpcError
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: ERROR timestamp {   seconds: 1623918805   nanos: 203286170 } message: "Python sdk harness failed: \nTraceback (most recent call last):\n  File \"/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", line 166, in main\n    sdk_harness.run()\n  File \"/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", line 259, in run\n    for work_request in self._control_stub.Control(get_responses()):\n  File \"/root/pyenv/lib/python3.7/site-packages/grpc/_channel.py\", line 426, in __next__\n    return self._next()\n  File \"/root/pyenv/lib/python3.7/site-packages/grpc/_channel.py\", line 826, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"Too many pings\"\n\tdebug_error_string = \"{\"created\":\"@1623918805.202476831\",\"description\":\"Error received from peer ipv4:127.0.0.1:38705\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1067,\"grpc_message\":\"Too many pings\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent call last):\n  File \"/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", line 166, in main\n    sdk_harness.run()\n  File \"/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", line 259, in run\n    for work_request in self._control_stub.Control(get_responses()):\n  File \"/root/pyenv/lib/python3.7/site-packages/grpc/_channel.py\", line 426, in __next__\n    return self._next()\n  File \"/root/pyenv/lib/python3.7/site-packages/grpc/_channel.py\", line 826, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"Too many pings\"\n\tdebug_error_string = \"{\"created\":\"@1623918805.202476831\",\"description\":\"Error received from peer ipv4:127.0.0.1:38705\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1067,\"grpc_message\":\"Too many pings\",\"grpc_status\":14}\"\n>\n" log_location: "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:169" thread: "MainThread" 
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 267, in <module>
    main(sys.argv)
  File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 166, in main
    sdk_harness.run()
  File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 259, in run
    for work_request in self._control_stub.Control(get_responses()):
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "Too many pings"
    debug_error_string = "{"created":"@1623918805.202476831","description":"Error received from peer ipv4:127.0.0.1:38705","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Too many pings","grpc_status":14}"
>
ERROR:apache_beam.runners.worker.data_plane:Failed to read inputs in the data plane.
Traceback (most recent call last):
  File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
    for elements in elements_iterator:
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 391, in __next__
    return self._next()
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 383, in _next
    request = self._look_for_request()
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 367, in _look_for_request
    _raise_rpc_error(self._state)
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 113, in _raise_rpc_error
    raise rpc_error
grpc.RpcError
E0617 08:33:26.585522269     352 chttp2_transport.cc:1117]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 598, in <lambda>
    target=lambda: self._read_inputs(elements_iterator),
  File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
    for elements in elements_iterator:
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 391, in __next__
    return self._next()
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 383, in _next
    request = self._look_for_request()
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 367, in _look_for_request
    _raise_rpc_error(self._state)
  File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 113, in _raise_rpc_error
    raise rpc_error
grpc.RpcError
Exception in thread run_worker:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/portability/local_job_service.py", line 216, in run
    'Worker subprocess exited with return code %s' % p.returncode)
RuntimeError: Worker subprocess exited with return code 1
axeltidemann commented 3 years ago

How can I read the post_transform_schema from the Transform component? TFDV has a load_stats_binary method, but unfortunately no load_schema_binary. For context, the SchemaGen component still outputs a schema.pbtxt file, but the Transform component outputs a Schema.pb binary file.

ConverJens commented 3 years ago

And on this note, why does SchemaGen produce pbtxt files while Transform uses pb files?

Also, in TFX io_utils there is a SchemaReader class which only reads pbtxt files but could trivially be extended to handle pb files as well.

ConverJens commented 3 years ago

@axeltidemann Regarding reading the new schema, it's quite easy with some code of your own:

from tensorflow_metadata.proto.v0 import schema_pb2
from tfx.utils import io_utils
from tfx.types import artifact_utils
uri = io_utils.get_only_uri_in_dir(artifact_utils.get_single_uri([schema_artifact]))
if uri.endswith('.pbtxt'):
    schema = io_utils.SchemaReader().read(uri)
else:
    schema = schema_pb2.Schema()
    with get_file_system_or_s3_descriptor(uri, 'rb') as f:
        schema.ParseFromString(f.read())
schemas[self.get_producer(schema_artifact) + ' ' + self.get_name_property(schema_artifact)] = schema
axeltidemann commented 3 years ago

Thank you for sharing, @ConverJens . Ideally, the SchemaReader should be able to read both file types, no?

ConverJens commented 3 years ago

In my opinion, absolutely! Using a similar logic as the one I posted, i.e. check file ending, this would be a trivial change hidden in the SchemaReader class.

jiyongjung0 commented 3 years ago

@axeltidemann @ConverJens Thank you so much for reporting the inconsistency. We are trying to converge the output format to schema.pbtxt in https://github.com/tensorflow/tfx/pull/3940 and it will be included in the final 1.0.0 release.

ConverJens commented 3 years ago

@ibzib @vaskozl I have now started getting the "ENHANCE_YOUR_CALM" error in the beam worker while using the Flink runner as well:

2021-06-23 14:47:07.824331: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:07.824383: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-23 14:47:08.169492: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.169536: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-23 14:47:08.379076: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.379143: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-23 14:47:08.508430: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.508483: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
E0623 15:05:13.989472303    1726 chttp2_transport.cc:1081]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0623 15:10:06.575559109    1754 chttp2_transport.cc:1081]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0623 15:11:49.658671058    1697 chttp2_transport.cc:1081]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0623 15:24:10.444800716    1781 chttp2_transport.cc:1081]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"

I have scaled up my Flink deployment and I'm now using 8 task managers with 8 slots each and sdk_worker_parallelism of 4. This exact setting worked fine on a subset of my data but when trying to ingest all of it, this message reappears.

Since this is now starting to happen in all non-GCP runners, this is becoming quite a blocker for moving to newer TFX versions.

ConverJens commented 3 years ago

@vaskozl @ibzib I opened a separate issue for this: https://github.com/tensorflow/tfx/issues/3961

RossKohler commented 3 years ago

@ConverJens I updated from rc0 to rc1 and I started receiving the same error in my BulkInferrer component. Downgrading back down to rc0 resolved the issue. Obviously not a fix, but just a bit of additional information for this thread.

ConverJens commented 3 years ago

@RossKohler Still, very interesting! I will try and see if it works for me. Thanks for your input!

arghyaganguly commented 3 years ago

4071

ConverJens commented 3 years ago

@ibzib @vaskozl The "ENHANCE_YOUR_CALM" error has disappeared for me as of Beam 2.31.0, both for DirectRunner and FlinkRunner.

venkat2469 commented 2 years ago

No Active issues found on TFX 1.0.0-rc1. will reopen if any issues found in future.

google-ml-butler[bot] commented 2 years ago

Are you satisfied with the resolution of your issue? Yes No