Closed dhruvesh09 closed 2 years ago
cc @arghyaganguly
https://github.com/tensorflow/tfx/commit/baccb6bc9650aa8f69548e8ca68473e83606f10b breaks kubeflow_dag_runner.KubeflowDagRunner(config=runner_config).run(tfx_pipeline)
in a CI/CD application where the CI is not configured with production bucket access due to component._resolve_pip_dependencies
trying to create the pipeline root.
I personally include the module_file
in the docker image and install any deps on top, so I think this behaviour should be optional and bucket access should not be required to build the pipeline.
Edit: This is already the case in 0.30.0
Something else I'm finding that the Transform now OOMs with relatively small datasets with the default statistics under DirectRunner (have not tested with other beam runners yet).
I'm having to set disable_statistics=True
.
Hi, we have a similar problem to what @vaskozl mentioned about the kubeflow_dag_runner
. We store the all code (transform, trainer and custom components) in a Docker image to guarantee code versioning etc.. I want to provide paths to code in my docker container that don't exist in my build pipeline when running the kubeflow_dag_runner.KubeflowDagRunner
. As a quick fix for the bucket access, you can just set pipeline_root = None
during the build process and provide a different pipline_root as runtime parameter in kubeflow. However, the paths provided to the transform/trainer need to exist when running kubeflow_dag_runner.KubeflowDagRunner
which means I can't provide paths that only exist in my docker container. My quick fix right now is to change the paths provided in the YAML file after running the KubeflowDagRunner. It would be nice to turn off the functionally that it packs them as a pip wheel for execution. If there is another way to do it please let me know.
@charlesccychen Could you take a look on above issue?
I'm hitting a Beam error that I have not seen in either TFX 0.28.0 or 0.29.0 (don't know about 0.30.0). This happens with Beam 2.29.0 and 2.30.0 with DirectRunner, haven't tested any other.
I get:
E0611 13:13:32.522413537 1240 chttp2_transport.cc:1081] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
Has anyone else seen this?
I'm getting the "ENAHCE_YOUR_CALM" when using DirectRunner on 2.29.0 with multi_threading. (The error doesn't appear in the in_memory
as afaik it doesn't use the GRPC communication ).
@vaskozl You're right! From version 0.29.0 and onwards this seems to be an issue with Direct runner and multiprocessing. in_memory works but is of course so much slower.
Did you find a fix/work around for this?
+Kyle Weaver @.***>
On Mon, Jun 14, 2021 at 6:54 AM ConverJens @.***> wrote:
@vaskozl https://github.com/vaskozl You're right! From version 0.29.0 and onwards this seems to be an issue with Direct runner and multiprocessing. in_memory works but is of course so much slower.
Did you find a fix/work around for this?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tfx/issues/3849#issuecomment-860703485, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALERO3J4IR73HBCAQQYETQDTSYCY5ANCNFSM45STUX6Q .
@ConverJens
Sadly not, I've been running with FlinkRunner
and DirectRunner
in_memory
exclusively,
@vaskozl That's a bummer! Thanks for letting me know.
Upgrading from TFX 0.27.0, I see that the resulting YAML files are much larger. There is a --tfx_ir
element (which I think is a TFX Intermediate Representation) in the YAML which specifies the entire pipeline for each component, i.e. repeating the specification many times.
For comparison, here is the YAML for the TFX 0.27.0 pipeline, which totals 1.7K lines of code. The same pipeline YAML in TFX 1.0.0-rc1 is a whopping 63K lines, which eventually makes the YAML file too big, as shown by the following error message:
Is there a way to disable the repeated entry of the tfx_ir
?
@vaskozl @ConverJens I've seen this issue in my own pipeline and noticed it was because of a dependency requiring TensorFlow 2.5.0. In my case, it was an unpinned tensorflow_text
dependency. Ensuring the TensorFlow version wasn't being updated from TensorFlow 2.4.* (I.e pinning tensorflow_text
to 2.4.3) fixed the issue for me. This also only seemed to happen to me in my Evaluator
and BulkInferrer
components though.
@axeltidemann This has been an issue since TFX switched to IR representation for me: https://github.com/tensorflow/tfx/issues/3459.
@1025KB I think this is quite a bad issue since it prevents the TFX taxi showcase pipeline from being run in KubeFlow, i.e. really the boiler plate components.
@RossKohler Are you referring to the "Enhance your calm" issue? Latest TFX depends on Tensorflow 2.5... I will try to pin TF text and see if that helps.
@ConverJens yes. TensorFlow 2.5 was the issue for me.
@RossKohler Just tried with TF 2.4.1 and TF Text 2.4.3 but the issue remains for me.
@ConverJens the only other suggestion I have would be to look out for any dependency issues while your image is installing requirements. Other than that, I'm out of ideas:(
@RossKohler I don't think dependency issues are the cause of this but thanks for the tip!
@ibzib Do you have any ideas about the cause of this?
Hi, I have tried out TFX CLI
for Vertex AI like below 👇
!tfx pipeline create \
--pipeline-path=kubeflow_v2_runner.py \
--engine=vertex \
--build-image
then it split out an error requests.exceptions.HTTPError: 404 Client Error
. The full error message is like below, and vp11
is the name of pipeline.
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.40/distribution/gcr.io/silken-psyche-312112/vp11/json
It looks like somehow TFX can not handle to push an image to the GCR. What do you think?
I just found a bug when copy taxi template.
The included Dockerfile
's base image is FROM tensorflow/tfx:1.0.0-rc1
, but there is no such thing.
Instead, if I replace it with FROM tensorflow/tfx:1.0.0rc1
, then it works.
I want to suggest this issue in a PR, what can I do?
@deep-diver Thank you for the report and suggestion!!
It seems like an old issue that existed for all rc releases. The tag name is generated in https://github.com/tensorflow/tfx/blob/70d1bb9ff09189316f010e165da742f84e8c73c3/tfx/tools/cli/container_builder/labels.py#L20 and the problem is that the version name pattern is not consistent for RCs. We might be able to add some special handling for RC version around https://github.com/tensorflow/tfx/blob/70d1bb9ff09189316f010e165da742f84e8c73c3/tfx/tools/cli/container_builder/dockerfile.py#L59-L62 .
@jiyongjung0 thanks for the quick response! so, this is the only issue for the RCs, and I don't need to PR about this problem?
also, I have an additional question related to Vertex AI.
in the legacy AI Platform, I was able to see the visual results after StatisticsGen
, ExampleValidator
, Evaluator
components are done, but it seems not for Vertex AI platform.
Right. So it would be nice to fix, but not super critical. :)
For the visualization, I think that you can inspect output artifacts in the notebooks or colabs. We are exploring several approaches for smoother integration, but there is nothing to share at this moment yet.
@ConverJens regarding the ENHANCE_YOUR_CALM error (btw that is a funny error message):
--direct_num_workers
? Have you tried adjusting it?in_memory works but is of course so much slower.
As soon as speed becomes an issue, I would recommend using Flink or Spark instead, since the direct runner isn't optimized for performance.
@ibzib It is a funny error message, the only issue is that I'm bad at enhancing my calm ;)
This issue did not happen with TFX 0.28.0 and beam 2.28 or 2.29. When bumping to TFX 1.0.0rc1 this started to occur for both beam 2.29 and 2.30.
Some of the processes crashes and pipelines fails ti recover and hangs until I terminate it.
I've had it set to 16. I just started a test run with num workers set to 4 and it failed in the same way.
I've switched to Flink runner now and although it is much faster at computations, it comes with the additional operational cost of maintaing Flink as well as random hangs and errors on the Flink side. So for quick iteration during the development phase, It would be very beneficial to have the direct runner working.
This is the full error message:
E0617 08:33:25.201831822 398 chttp2_transport.cc:1117] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
Exception in thread beam_control_read:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/worker_handlers.py", line 117, in _read
for data in self._input:
File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 391, in __next__
return self._next()
File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 383, in _next
request = self._look_for_request()
File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 367, in _look_for_request
_raise_rpc_error(self._state)
File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 113, in _raise_rpc_error
raise rpc_error
grpc.RpcError
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: ERROR timestamp { seconds: 1623918805 nanos: 203286170 } message: "Python sdk harness failed: \nTraceback (most recent call last):\n File \"/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", line 166, in main\n sdk_harness.run()\n File \"/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", line 259, in run\n for work_request in self._control_stub.Control(get_responses()):\n File \"/root/pyenv/lib/python3.7/site-packages/grpc/_channel.py\", line 426, in __next__\n return self._next()\n File \"/root/pyenv/lib/python3.7/site-packages/grpc/_channel.py\", line 826, in _next\n raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"Too many pings\"\n\tdebug_error_string = \"{\"created\":\"@1623918805.202476831\",\"description\":\"Error received from peer ipv4:127.0.0.1:38705\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1067,\"grpc_message\":\"Too many pings\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent call last):\n File \"/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", line 166, in main\n sdk_harness.run()\n File \"/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", line 259, in run\n for work_request in self._control_stub.Control(get_responses()):\n File \"/root/pyenv/lib/python3.7/site-packages/grpc/_channel.py\", line 426, in __next__\n return self._next()\n File \"/root/pyenv/lib/python3.7/site-packages/grpc/_channel.py\", line 826, in _next\n raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"Too many pings\"\n\tdebug_error_string = \"{\"created\":\"@1623918805.202476831\",\"description\":\"Error received from peer ipv4:127.0.0.1:38705\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1067,\"grpc_message\":\"Too many pings\",\"grpc_status\":14}\"\n>\n" log_location: "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:169" thread: "MainThread"
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 267, in <module>
main(sys.argv)
File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 166, in main
sdk_harness.run()
File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 259, in run
for work_request in self._control_stub.Control(get_responses()):
File "/root/pyenv/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/root/pyenv/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Too many pings"
debug_error_string = "{"created":"@1623918805.202476831","description":"Error received from peer ipv4:127.0.0.1:38705","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Too many pings","grpc_status":14}"
>
ERROR:apache_beam.runners.worker.data_plane:Failed to read inputs in the data plane.
Traceback (most recent call last):
File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
for elements in elements_iterator:
File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 391, in __next__
return self._next()
File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 383, in _next
request = self._look_for_request()
File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 367, in _look_for_request
_raise_rpc_error(self._state)
File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 113, in _raise_rpc_error
raise rpc_error
grpc.RpcError
E0617 08:33:26.585522269 352 chttp2_transport.cc:1117] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 598, in <lambda>
target=lambda: self._read_inputs(elements_iterator),
File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs
for elements in elements_iterator:
File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 391, in __next__
return self._next()
File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 383, in _next
request = self._look_for_request()
File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 367, in _look_for_request
_raise_rpc_error(self._state)
File "/root/pyenv/lib/python3.7/site-packages/grpc/_server.py", line 113, in _raise_rpc_error
raise rpc_error
grpc.RpcError
Exception in thread run_worker:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/root/pyenv/lib/python3.7/site-packages/apache_beam/runners/portability/local_job_service.py", line 216, in run
'Worker subprocess exited with return code %s' % p.returncode)
RuntimeError: Worker subprocess exited with return code 1
How can I read the post_transform_schema
from the Transform
component? TFDV has a load_stats_binary
method, but unfortunately no load_schema_binary
. For context, the SchemaGen
component still outputs a schema.pbtxt
file, but the Transform component outputs a Schema.pb
binary file.
And on this note, why does SchemaGen produce pbtxt files while Transform uses pb files?
Also, in TFX io_utils there is a SchemaReader class which only reads pbtxt files but could trivially be extended to handle pb files as well.
@axeltidemann Regarding reading the new schema, it's quite easy with some code of your own:
from tensorflow_metadata.proto.v0 import schema_pb2
from tfx.utils import io_utils
from tfx.types import artifact_utils
uri = io_utils.get_only_uri_in_dir(artifact_utils.get_single_uri([schema_artifact]))
if uri.endswith('.pbtxt'):
schema = io_utils.SchemaReader().read(uri)
else:
schema = schema_pb2.Schema()
with get_file_system_or_s3_descriptor(uri, 'rb') as f:
schema.ParseFromString(f.read())
schemas[self.get_producer(schema_artifact) + ' ' + self.get_name_property(schema_artifact)] = schema
Thank you for sharing, @ConverJens . Ideally, the SchemaReader
should be able to read both file types, no?
In my opinion, absolutely! Using a similar logic as the one I posted, i.e. check file ending, this would be a trivial change hidden in the SchemaReader class.
@axeltidemann @ConverJens Thank you so much for reporting the inconsistency. We are trying to converge the output format to schema.pbtxt
in https://github.com/tensorflow/tfx/pull/3940 and it will be included in the final 1.0.0 release.
@ibzib @vaskozl I have now started getting the "ENHANCE_YOUR_CALM" error in the beam worker while using the Flink runner as well:
2021-06-23 14:47:07.824331: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:07.824383: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-23 14:47:08.169492: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.169536: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-23 14:47:08.379076: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.379143: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-06-23 14:47:08.508430: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 14:47:08.508483: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
E0623 15:05:13.989472303 1726 chttp2_transport.cc:1081] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0623 15:10:06.575559109 1754 chttp2_transport.cc:1081] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0623 15:11:49.658671058 1697 chttp2_transport.cc:1081] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0623 15:24:10.444800716 1781 chttp2_transport.cc:1081] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
I have scaled up my Flink deployment and I'm now using 8 task managers with 8 slots each and sdk_worker_parallelism of 4. This exact setting worked fine on a subset of my data but when trying to ingest all of it, this message reappears.
Since this is now starting to happen in all non-GCP runners, this is becoming quite a blocker for moving to newer TFX versions.
@vaskozl @ibzib I opened a separate issue for this: https://github.com/tensorflow/tfx/issues/3961
@ConverJens I updated from rc0 to rc1 and I started receiving the same error in my BulkInferrer
component. Downgrading back down to rc0 resolved the issue. Obviously not a fix, but just a bit of additional information for this thread.
@RossKohler Still, very interesting! I will try and see if it works for me. Thanks for your input!
@ibzib @vaskozl The "ENHANCE_YOUR_CALM" error has disappeared for me as of Beam 2.31.0, both for DirectRunner and FlinkRunner.
No Active issues found on TFX 1.0.0-rc1. will reopen if any issues found in future.
Please comment if you find any issues with TFX 1.0.0-rc1
Thanks