tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.org/tfx
Apache License 2.0
2.11k stars 706 forks source link

TFX components in GCP does not display component logs in GCP Vertex AI #6539

Open crbl1122 opened 9 months ago

crbl1122 commented 9 months ago

If the bug is related to a specific library below, please raise an issue in the respective repo directly: TFX

TensorFlow Data Validation Repo

TensorFlow Model Analysis Repo

TensorFlow Transform Repo

TensorFlow Serving Repo

System information

Describe the current behavior

I am running in GCP Vertex AI Kubeflow pipelines with TFX components. The problem is that no component logs are displayed in the Vertex interface (neither main job nor pipeline job) while in the Logs Explorer only framework messages are displayed. This is irrespective of the component type (ExamplesGen, Trainer, Transform, etc) and leads to very difficult blindly debugging of TFX components. I submit the pipelines using a service account which has Logs Writer/Reader privileges.

image

Describe the expected behavior Be able to view the component logs for code debugging.

Standalone code to reproduce the issue

Providing a bare minimum test case or step(s) to reproduce the problem will greatly help us to debug the issue. If possible, please share a link to Colab/Jupyter/any notebook.

Name of your Organization (Optional)

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

singhniraj08 commented 9 months ago

@lego0901,

This issue has been raised for viewing component level logs in Logs explorer while running TFX pipelines in Vertex AI. I was unable to find any settings which can enable in the container logs. Please let me know if I am missing anything. Thank you!

ImmanuelXIV commented 9 months ago

+1. Logs are also not displayed when using PyTorch + Kubeflow pipelines. Please fix it, this seems to be a general issue. Not only makes debugging tricky but I also can't get information if the specified GPU and memory is utilized when training.

singhniraj08 commented 8 months ago

@ImmanuelXIV, This repo is for issues you face while implementing TFX pipelines. I would request you to open a issue with cloud support team. You can follow Get Support to raise an issue. Thank you!

crbl1122 commented 8 months ago

+1. Logs are also not displayed when using PyTorch + Kubeflow pipelines. Please fix it, this seems to be a general issue. Not only makes debugging tricky but I also can't get information if the specified GPU and memory is utilized when training.

Strange that this is a general issue.

adriangay commented 8 months ago

We are experiencing the same issue in VAI trying to migrate our training pipelines to 1.14. I have raised a Google Support Case. Has anyone else experiencing this issue raised a case? Would be good to compare notes.

lego0901 commented 8 months ago

Hello, we also ran several VAI pipelines with our hands but we were able to see the component logs, regardless if a component run failed or not. This is very weird and I want to check if “all” components logs are not displayed regardless if it failed or not, @crbl1122.

But, I can give you a general way to debug.

  1. We usually can't see the component logs if the orchestrator fails to launch a component.

  2. If that's the case, we have to see the orchestrator's log and you can find this in Error Reporting. So please visit there and see if there is a relevant error.

  3. Otherwise, you can follow Get Support to raise an issue.

crbl1122 commented 8 months ago

@lego0901 I confirm that no logs or errors are seen neither for components running successfully, nor for the ones which are crashing during execution.

lego0901 commented 8 months ago

I would like to express my gratitude for your confirmation.

May I request further information from you so that we can conduct a more thorough investigation into this matter? Since we are unable to reproduce the issue on our end (despite the fact that numerous users are encountering the same problem), we require additional input regarding your specific situation.

Could you kindly provide responses to the following questions:

  1. Did this phenomenon occur prior to TFX version 1.14.0? If not, we can confirm that this is an issue with the TFX codebase, which will allow us to narrow down our investigation.

  2. Could you please provide more detailed information about your running environment? I would like to have the output of the pip freeze command in its entirety so that I can attempt to reproduce the issue in my own environment.

  3. In the scenario you described, would it be possible for you to provide me with a simple example code that reproduces the error? Even a very brief pipeline with a single component would be sufficient.

Thank you very much for your assistance.

adriangay commented 8 months ago

we also do not see anything in Error Reporting.

we did not see this before TFX 1.14

We manage depedencies using Poetry, Github does not support uplaod of lock files, but here is output of pip freeze

absl-py==1.4.0
anyio==4.2.0
apache-beam==2.50.0
appnope==0.1.3
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
astunparse==1.6.3
attrs==21.4.0
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.1.0
cachetools==5.3.2
certifi==2023.11.17
cffi==1.16.0
cfgv==3.4.0
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==2.2.1
comm==0.2.1
crcmod==1.7
debugpy==1.8.0
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.1.1
distlib==0.3.8
dnspython==2.4.2
docker==4.2.2
docopt==0.6.2
docstring-parser==0.15
dynaconf==3.2.4
entrypoints==0.4
exceptiongroup==1.2.0
fastavro==1.9.3
fasteners==0.19
fastjsonschema==2.19.1
filelock==3.13.1
fire==0.5.0
flake8==3.9.2
flatbuffers==23.5.26
fqdn==1.5.1
gast==0.4.0
gensim==4.3.2
google-api-core==2.15.0
google-api-python-client==1.12.11
google-apitools==0.5.31
google-auth==2.26.1
google-auth-httplib2==0.1.1
google-auth-oauthlib==1.0.0
google-cloud-aiplatform==1.39.0
google-cloud-bigquery==2.34.4
google-cloud-bigquery-storage==2.24.0
google-cloud-bigtable==2.22.0
google-cloud-core==2.4.1
google-cloud-datastore==2.19.0
google-cloud-dlp==3.14.0
google-cloud-language==2.12.0
google-cloud-pubsub==2.19.0
google-cloud-pubsublite==1.9.0
google-cloud-recommendations-ai==0.10.6
google-cloud-resource-manager==1.11.0
google-cloud-spanner==3.40.1
google-cloud-storage==2.14.0
google-cloud-videointelligence==2.12.0
google-cloud-vision==3.5.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
grpc-google-iam-v1==0.13.0
grpcio==1.60.0
grpcio-status==1.48.2
h5py==3.10.0
hdfs==2.7.3
httplib2==0.22.0
identify==2.5.33
idna==3.6
iniconfig==2.0.0
ipykernel==6.28.0
ipython==7.34.0
ipython-genutils==0.2.0
ipywidgets==7.8.1
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.2
joblib==1.3.2
jsonpointer==2.4
jsonschema==4.17.3
jupyter-events==0.6.3
jupyter_client==7.4.9
jupyter_core==5.7.1
jupyter_server==2.10.0
jupyter_server_terminals==0.5.1
jupyterlab-widgets==1.1.7
jupyterlab_pygments==0.3.0
keras==2.13.1
keras-tuner==1.4.6
kfp==1.8.22
kfp-pipeline-spec==0.1.16
kfp-server-api==1.8.5
kt-legacy==1.0.5
kubernetes==12.0.1
libclang==16.0.6
llvmlite==0.41.1
Markdown==3.5.1
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mccabe==0.6.1
mistune==3.0.2
ml-metadata==1.14.0
ml-pipelines-sdk==1.14.0
mock==4.0.3
nbclassic==1.0.0
nbclient==0.9.0
nbconvert==7.14.0
nbformat==5.9.2
nest-asyncio==1.5.8
nodeenv==1.8.0
notebook==6.5.6
notebook_shim==0.2.3
nptyping==2.5.0
numba==0.58.1
numba-progress==1.1.0
numpy==1.24.3
oauth2client==4.1.3
oauthlib==3.2.2
objsize==0.6.1
opt-einsum==3.3.0
orjson==3.9.10
overrides==7.4.0
packaging==20.9
pandas==1.5.3
pandocfilters==1.5.0
parso==0.8.3
pecanpy==2.0.8
pexpect==4.9.0
pickleshare==0.7.5
pillow==10.2.0
platformdirs==4.1.0
pluggy==1.3.0
portpicker==1.6.0
pre-commit==2.13.0
prometheus-client==0.19.0
prompt-toolkit==3.0.43
proto-plus==1.23.0
protobuf==3.20.3
psutil==5.9.7
ptyprocess==0.7.0
pyarrow==10.0.1
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycodestyle==2.7.0
pycparser==2.21
pydantic==1.10.13
pydot==1.4.2
pyfarmhash==0.3.2
pyflakes==2.3.1
Pygments==2.17.2
pymongo==4.6.1
pyparsing==3.1.1
pyrsistent==0.20.0
pytest==7.4.0
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3.post1
PyYAML==6.0.1
pyzmq==24.0.1
regex==2023.12.25
requests==2.31.0
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rsa==4.9
scikit-learn==1.3.2
scipy==1.11.4
Send2Trash==1.8.2
Shapely==1.8.5.post1
six==1.16.0
smart-open==6.4.0
sniffio==1.3.0
soupsieve==2.5
sqlparse==0.4.4
strip-hints==0.1.10
tabulate==0.9.0
tensorboard==2.13.0
tensorboard-data-server==0.7.2
tensorflow==2.13.1
tensorflow-addons==0.23.0
tensorflow-data-validation==1.14.0
tensorflow-estimator==2.13.0
tensorflow-hub==0.13.0
tensorflow-io-gcs-filesystem==0.35.0
tensorflow-metadata==1.14.0
tensorflow-model-analysis==0.45.0
tensorflow-serving-api==2.13.1
tensorflow-transform==1.14.0
termcolor==2.4.0
terminado==0.18.0
tfx==1.14.0
tfx-bsl==1.14.0
threadpoolctl==3.2.0
tinycss2==1.2.1
toml==0.10.2
tomli==2.0.1
tornado==6.4
tqdm==4.66.1
traitlets==5.14.1
typeguard==2.13.3
typer==0.9.0
types-python-dateutil==2.8.19.20240106
typing_extensions==4.5.0
uri-template==1.3.0
uritemplate==3.0.1
urllib3==1.26.18
virtualenv==20.25.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
Werkzeug==3.0.1
widgetsnbextension==3.6.6
wrapt==1.16.0
zstandard==0.22.0
crbl1122 commented 8 months ago

I would like to express my gratitude for your confirmation.

May I request further information from you so that we can conduct a more thorough investigation into this matter? Since we are unable to reproduce the issue on our end (despite the fact that numerous users are encountering the same problem), we require additional input regarding your specific situation.

Could you kindly provide responses to the following questions:

1. Did this phenomenon occur prior to TFX version 1.14.0?
   If not, we can confirm that this is an issue with the TFX codebase, which will allow us to narrow down our investigation.

2. Could you please provide more detailed information about your running environment?
   I would like to have the output of the `pip freeze` command in its entirety so that I can attempt to reproduce the issue in my own environment.

3. In the scenario you described, would it be possible for you to provide me with a simple example code that reproduces the error?
   Even a very brief pipeline with a single component would be sufficient.

Thank you very much for your assistance.

Hi,

TFX==1.12.0. The problem is for any standard TFX component. absl-py==1.4.0 aiohttp-cors==0.7.0 aiorwlock==1.3.0 ansiwrap==0.8.4 apache-beam==2.45.0 astunparse==1.6.3 asynctest==0.13.0 attrs==20.3.0 Babel==2.12.1 backoff==2.2.1 blessed==1.20.0 cachetools==4.2.4 certifi==2023.7.22 click==8.1.7 cloud-tpu-client==0.10 cloud-tpu-profiler==2.4.0 cloudpickle==2.2.1 colorama==0.4.6 colorful==0.5.5 comm==0.1.4 conda==22.9.0 crcmod==1.7 cycler==0.11.0 Cython==3.0.2 dacite==1.8.1 db-dtypes==1.1.1 Deprecated==1.2.14 dill==0.3.1.1 distlib==0.3.7 dm-tree==0.1.8 docker==4.4.4 docopt==0.6.2 docstring-parser==0.15 etils==0.9.0 explainable-ai-sdk==1.3.3 Farama-Notifications==0.0.4 fastapi==0.103.1 fastavro==1.8.0 fasteners==0.19 filelock==3.12.2 flatbuffers==2.0.7 fonttools==4.38.0 fsspec==2023.1.0 future==0.18.3 gast==0.3.3 gcsfs==2023.1.0 gitdb==4.0.10 GitPython==3.1.37 google-api-core==1.34.0 google-api-python-client==1.8.0 google-apitools==0.5.31 google-auth-httplib2==0.1.1 google-auth-oauthlib==0.4.6 google-cloud-aiplatform==1.17.1 google-cloud-artifact-registry==1.8.3 google-cloud-bigquery==2.34.4 google-cloud-bigquery-storage==2.16.2 google-cloud-bigtable==1.7.3 google-cloud-dlp==3.9.2 google-cloud-language==1.3.2 google-cloud-monitoring==2.15.1 google-cloud-pubsub==2.13.11 google-cloud-pubsublite==1.6.0 google-cloud-recommendations-ai==0.7.1 google-cloud-resource-manager==1.6.3 google-cloud-spanner==3.26.0 google-cloud-storage==2.11.0 google-cloud-videointelligence==1.16.3 google-cloud-vision==3.1.4 google-crc32c==1.5.0 google-pasta==0.2.0 google-resumable-media==2.6.0 gpustat==1.0.0 greenlet==2.0.2 grpc-google-iam-v1==0.12.6 grpcio==1.58.0 gviz-api==1.10.0 gymnasium==0.28.1 h11==0.14.0 h5py==2.10.0 hdfs==2.7.2 htmlmin==0.1.12 httplib2==0.20.4 ImageHash==4.3.1 imageio==2.31.2 importlib-resources==5.12.0 ipython-genutils==0.2.0 ipython-sql==0.5.0 ipywidgets==7.8.1 jaraco.classes==3.2.3 jax-jumpy==1.0.0 jeepney==0.8.0 Jinja2==2.11.3 joblib==1.3.2 json5==0.9.14 jupyter-http-over-ws==0.0.8 jupyter-server-mathjax==0.2.6 jupyter-server-proxy==3.2.2 jupyterlab==3.4.8 jupyterlab-widgets==1.1.7 jupyterlab_git==0.43.0 jupyterlab_server==2.24.0 jupytext==1.15.2 keras==2.11.0 keras-core==0.0.0 Keras-Preprocessing==1.1.2 keras-tuner==1.4.1 keyring==24.1.1 keyrings.google-artifactregistry-auth==1.1.2 kfp==2.6.0 kfp-pipeline-spec==0.3.0 kfp-server-api==2.0.5 kiwisolver==1.4.5 kt-legacy==1.0.5 kubernetes==11.0.0 libclang==16.0.6 llvmlite==0.39.1 lz4==4.3.2 Markdown==3.4.4 markdown-it-py==2.2.0 MarkupSafe==2.0.1 matplotlib==3.5.3 mdit-py-plugins==0.3.5 mdurl==0.1.2 mistune==0.8.4 ml-metadata==1.12.0 ml-pipelines-sdk==1.12.0 more-itertools==9.1.0 msgpack==1.0.5 multimethod==1.9.1 nbclient==0.5.13 nbconvert==6.4.5 nbdime==3.2.0 networkx==2.6.3 numba==0.56.4 numpy==1.21.6 nvidia-ml-py==11.495.46 oauth2client==4.1.3 oauthlib==3.2.2 objsize==0.6.1 opencensus==0.11.3 opencensus-context==0.1.3 opentelemetry-api==1.20.0 opentelemetry-exporter-otlp==1.20.0 opentelemetry-exporter-otlp-proto-common==1.20.0 opentelemetry-exporter-otlp-proto-grpc==1.20.0 opentelemetry-exporter-otlp-proto-http==1.20.0 opentelemetry-proto==1.20.0 opentelemetry-sdk==1.20.0 opentelemetry-semantic-conventions==0.41b0 opt-einsum==3.3.0 orjson==3.9.7 overrides==6.5.0 packaging==20.9 pandas==1.3.5 pandas-profiling==3.6.6 papermill==2.4.0 patsy==0.5.3 phik==0.12.3 Pillow==9.5.0 platformdirs==3.10.0 plotly==5.17.0 pluggy==1.2.0 portpicker==1.6.0 prettytable==3.7.0 promise==2.3 proto-plus==1.22.3 protobuf==3.20.1 py-spy==0.3.14 pyarrow==6.0.1 pydantic==1.10.12 pydot==1.4.2 pyfarmhash==0.3.2 PyJWT==2.8.0 pymongo==3.13.0 pyparsing==3.1.1 pytz==2023.3.post1 PyWavelets==1.3.0 PyYAML==5.4.1 ray==2.7.0 ray-cpp==2.7.0 regex==2023.8.8 requests-oauthlib==1.3.1 requests-toolbelt==0.10.1 retrying==1.3.3 rich==13.5.3 scikit-image==0.19.3 scikit-learn==1.0.2 scipy==1.7.3 seaborn==0.12.2 SecretStorage==3.3.3 simpervisor==0.4 smart-open==6.4.0 smmap==5.0.1 SQLAlchemy==2.0.21 sqlparse==0.4.4 starlette==0.27.0 statsmodels==0.13.5 tabulate==0.9.0 tangled-up-in-unicode==0.2.0 tenacity==8.2.3 tensorboard==2.11.2 tensorboard-data-server==0.6.1 tensorboard-plugin-profile==2.13.1 tensorboard-plugin-wit==1.8.1 tensorboardX==2.6 tensorflow==2.11.0 tensorflow-cloud==0.1.16 tensorflow-data-validation==1.12.0 tensorflow-datasets==4.8.2 tensorflow-estimator==2.11.0 tensorflow-hub==0.9.0 tensorflow-io==0.29.0 tensorflow-io-gcs-filesystem==0.29.0 tensorflow-metadata==1.12.0 tensorflow-model-analysis==0.43.0 tensorflow-probability==0.19.0 tensorflow-serving-api==2.11.0 tensorflow-transform==1.12.0 termcolor==2.3.0 testpath==0.6.0 textwrap3==0.9.2 tfx==1.12.0 tfx-bsl==1.12.0 threadpoolctl==3.1.0 tifffile==2021.11.2 toml==0.10.2 tomli==2.0.1 tqdm==4.66.1 typeguard==2.13.3 typer==0.9.0 uritemplate==3.0.1 uvicorn==0.22.0 virtualenv==20.21.0 visions==0.7.5 watchfiles==0.20.0 Werkzeug==2.1.2 widgetsnbextension==3.6.6 witwidget==1.8.1 wordcloud==1.9.2 wrapt==1.15.0 ydata-profiling==4.5.1

lego0901 commented 7 months ago

Thanks for providing your environments!

However, I was not able to reproduce the phenomenon for both configurations, using the VAI example running locally:

I think some configurations, not the TFX, are outdated so the logs are not displayed. Let me contact to Vertex AI team engineer internally to figure out the problem. Thank you.

Screenshot 2024-02-16 at 4 52 17 PM
ERROR: Ignored the following yanked versions: 3.0.6, 3.5.0, 3.7.0, 3.17.0, 4.0.0, 4.0.1, 4.0.2, 4.0.3, 4.0.4, 4.0.5, 4.0.7, 4.0.8, 4.0.9, 4.1.2, 4.1.6, 4.2.6, 4.2.7, 4.3.13, 4.3.16
ERROR: Ignored the following versions that require a different python version: 2.10.0 Requires-Python >=2.7,<3.0; 2.3.0 Requires-Python >=2.7,<3.0; 2.4.0 Requires-Python >=2.7,<3.0; 2.5.0 Requires-Python >=2.7,<3.0; 2.6.0 Requires-Python >=2.7,<3.0; 2.7.0 Requires-Python >=2.7,<3.0; 2.8.0 Requires-Python >=2.7,<3.0; 2.9.0 Requires-Python >=2.7,<3.0
ERROR: Could not find a version that satisfies the requirement conda==22.9.0 (from versions: none)
ERROR: No matching distribution found for conda==22.9.0
adriangay commented 7 months ago

@lego0901 Hi, thank you for pursuing this. I see you cannot reproduce with the Penguin Example. I will try to reproduce with a simple pipeline to further aid problem determination...

crbl1122 commented 7 months ago

@lego0901 Hi, I want to add that the same problem occurs for Apache Beam jobs in Dataflow. No logs are displayed. So far, except Kubeflow all other pipeline types I tested (TFX, Dataflow/Beam), does not produce any logs.

IzakMaraisTAL commented 2 months ago

We had the same problem when we initially migrated to Vertex from Kubeflow, but only in our production GCP project. Logs worked fine in the testing project.

It took a long investigation to discover that the cause was the default logging bucket for the production GCP project was disabled. Enabling logging bucket fixed the problem.

crbl1122 commented 2 months ago

We had the same problem when we initially migrated to Vertex from Kubeflow, but only in our production GCP project. Logs worked fine in the testing project.

It took a long investigation to discover that the cause was the default logging bucket for the production GCP project was disabled. Enabling logging bucket fixed the problem.

Hi @IzakMaraisTAL Where the default logging bucket is defined and how you enabled it? In Vertex AI pipeline if I do not use TFX, no special setting has to be made in order to view the logs.

IzakMaraisTAL commented 2 months ago

@crbl1122 , it might be that our issue is different from what you are seeing. We have not tested any non-TFX Vertex AI pipelines.

The following instructions are LLM generated, but I double checked them in our testing project and they seem correct:

To re-enable the default logging buckets for a Google Cloud Platform (GCP) project, you can follow these steps:

  1. Open the Google Cloud Console: Go to the Google Cloud Console and select the project for which you want to enable logging.
  2. Navigate to Logging: In the navigation menu, go to Logging under the Operations section.
  3. Logs Storage: Click on Logs Storage in the left-hand menu. This will show you the list of logging buckets available in your project.
  4. Check the Status: Ensure that the logging bucket you are interested in is not disabled. If it is disabled, you will need to enable it.
  5. Modify Bucket Settings:
    • Click on the logging bucket that you want to modify.
    • In the bucket details page, look for an option to Edit or Enable the bucket.
    • Make sure that the bucket is set to receive logs. You might need to adjust the settings to ensure it's capturing the appropriate log types and from the correct sources.
  6. Verify Configuration: After enabling the bucket, it's a good idea to verify that logs are being stored correctly. You can do this by checking for recent logs in the Log Explorer.

image

crbl1122 commented 2 months ago

@IzakMaraisTAL Many thanks for the info.