Open crbl1122 opened 9 months ago
@lego0901,
This issue has been raised for viewing component level logs in Logs explorer while running TFX pipelines in Vertex AI. I was unable to find any settings which can enable in the container logs. Please let me know if I am missing anything. Thank you!
+1. Logs are also not displayed when using PyTorch + Kubeflow pipelines. Please fix it, this seems to be a general issue. Not only makes debugging tricky but I also can't get information if the specified GPU and memory is utilized when training.
@ImmanuelXIV, This repo is for issues you face while implementing TFX pipelines. I would request you to open a issue with cloud support team. You can follow Get Support to raise an issue. Thank you!
+1. Logs are also not displayed when using PyTorch + Kubeflow pipelines. Please fix it, this seems to be a general issue. Not only makes debugging tricky but I also can't get information if the specified GPU and memory is utilized when training.
Strange that this is a general issue.
We are experiencing the same issue in VAI trying to migrate our training pipelines to 1.14. I have raised a Google Support Case. Has anyone else experiencing this issue raised a case? Would be good to compare notes.
Hello, we also ran several VAI pipelines with our hands but we were able to see the component logs, regardless if a component run failed or not. This is very weird and I want to check if “all” components logs are not displayed regardless if it failed or not, @crbl1122.
But, I can give you a general way to debug.
We usually can't see the component logs if the orchestrator fails to launch a component.
If that's the case, we have to see the orchestrator's log and you can find this in Error Reporting. So please visit there and see if there is a relevant error.
Otherwise, you can follow Get Support to raise an issue.
@lego0901 I confirm that no logs or errors are seen neither for components running successfully, nor for the ones which are crashing during execution.
I would like to express my gratitude for your confirmation.
May I request further information from you so that we can conduct a more thorough investigation into this matter? Since we are unable to reproduce the issue on our end (despite the fact that numerous users are encountering the same problem), we require additional input regarding your specific situation.
Could you kindly provide responses to the following questions:
Did this phenomenon occur prior to TFX version 1.14.0? If not, we can confirm that this is an issue with the TFX codebase, which will allow us to narrow down our investigation.
Could you please provide more detailed information about your running environment?
I would like to have the output of the pip freeze
command in its entirety so that I can attempt to reproduce the issue in my own environment.
In the scenario you described, would it be possible for you to provide me with a simple example code that reproduces the error? Even a very brief pipeline with a single component would be sufficient.
Thank you very much for your assistance.
we also do not see anything in Error Reporting.
we did not see this before TFX 1.14
We manage depedencies using Poetry, Github does not support uplaod of lock files, but here is output of pip freeze
absl-py==1.4.0
anyio==4.2.0
apache-beam==2.50.0
appnope==0.1.3
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
astunparse==1.6.3
attrs==21.4.0
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.1.0
cachetools==5.3.2
certifi==2023.11.17
cffi==1.16.0
cfgv==3.4.0
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==2.2.1
comm==0.2.1
crcmod==1.7
debugpy==1.8.0
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.1.1
distlib==0.3.8
dnspython==2.4.2
docker==4.2.2
docopt==0.6.2
docstring-parser==0.15
dynaconf==3.2.4
entrypoints==0.4
exceptiongroup==1.2.0
fastavro==1.9.3
fasteners==0.19
fastjsonschema==2.19.1
filelock==3.13.1
fire==0.5.0
flake8==3.9.2
flatbuffers==23.5.26
fqdn==1.5.1
gast==0.4.0
gensim==4.3.2
google-api-core==2.15.0
google-api-python-client==1.12.11
google-apitools==0.5.31
google-auth==2.26.1
google-auth-httplib2==0.1.1
google-auth-oauthlib==1.0.0
google-cloud-aiplatform==1.39.0
google-cloud-bigquery==2.34.4
google-cloud-bigquery-storage==2.24.0
google-cloud-bigtable==2.22.0
google-cloud-core==2.4.1
google-cloud-datastore==2.19.0
google-cloud-dlp==3.14.0
google-cloud-language==2.12.0
google-cloud-pubsub==2.19.0
google-cloud-pubsublite==1.9.0
google-cloud-recommendations-ai==0.10.6
google-cloud-resource-manager==1.11.0
google-cloud-spanner==3.40.1
google-cloud-storage==2.14.0
google-cloud-videointelligence==2.12.0
google-cloud-vision==3.5.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
grpc-google-iam-v1==0.13.0
grpcio==1.60.0
grpcio-status==1.48.2
h5py==3.10.0
hdfs==2.7.3
httplib2==0.22.0
identify==2.5.33
idna==3.6
iniconfig==2.0.0
ipykernel==6.28.0
ipython==7.34.0
ipython-genutils==0.2.0
ipywidgets==7.8.1
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.2
joblib==1.3.2
jsonpointer==2.4
jsonschema==4.17.3
jupyter-events==0.6.3
jupyter_client==7.4.9
jupyter_core==5.7.1
jupyter_server==2.10.0
jupyter_server_terminals==0.5.1
jupyterlab-widgets==1.1.7
jupyterlab_pygments==0.3.0
keras==2.13.1
keras-tuner==1.4.6
kfp==1.8.22
kfp-pipeline-spec==0.1.16
kfp-server-api==1.8.5
kt-legacy==1.0.5
kubernetes==12.0.1
libclang==16.0.6
llvmlite==0.41.1
Markdown==3.5.1
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mccabe==0.6.1
mistune==3.0.2
ml-metadata==1.14.0
ml-pipelines-sdk==1.14.0
mock==4.0.3
nbclassic==1.0.0
nbclient==0.9.0
nbconvert==7.14.0
nbformat==5.9.2
nest-asyncio==1.5.8
nodeenv==1.8.0
notebook==6.5.6
notebook_shim==0.2.3
nptyping==2.5.0
numba==0.58.1
numba-progress==1.1.0
numpy==1.24.3
oauth2client==4.1.3
oauthlib==3.2.2
objsize==0.6.1
opt-einsum==3.3.0
orjson==3.9.10
overrides==7.4.0
packaging==20.9
pandas==1.5.3
pandocfilters==1.5.0
parso==0.8.3
pecanpy==2.0.8
pexpect==4.9.0
pickleshare==0.7.5
pillow==10.2.0
platformdirs==4.1.0
pluggy==1.3.0
portpicker==1.6.0
pre-commit==2.13.0
prometheus-client==0.19.0
prompt-toolkit==3.0.43
proto-plus==1.23.0
protobuf==3.20.3
psutil==5.9.7
ptyprocess==0.7.0
pyarrow==10.0.1
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycodestyle==2.7.0
pycparser==2.21
pydantic==1.10.13
pydot==1.4.2
pyfarmhash==0.3.2
pyflakes==2.3.1
Pygments==2.17.2
pymongo==4.6.1
pyparsing==3.1.1
pyrsistent==0.20.0
pytest==7.4.0
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3.post1
PyYAML==6.0.1
pyzmq==24.0.1
regex==2023.12.25
requests==2.31.0
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rsa==4.9
scikit-learn==1.3.2
scipy==1.11.4
Send2Trash==1.8.2
Shapely==1.8.5.post1
six==1.16.0
smart-open==6.4.0
sniffio==1.3.0
soupsieve==2.5
sqlparse==0.4.4
strip-hints==0.1.10
tabulate==0.9.0
tensorboard==2.13.0
tensorboard-data-server==0.7.2
tensorflow==2.13.1
tensorflow-addons==0.23.0
tensorflow-data-validation==1.14.0
tensorflow-estimator==2.13.0
tensorflow-hub==0.13.0
tensorflow-io-gcs-filesystem==0.35.0
tensorflow-metadata==1.14.0
tensorflow-model-analysis==0.45.0
tensorflow-serving-api==2.13.1
tensorflow-transform==1.14.0
termcolor==2.4.0
terminado==0.18.0
tfx==1.14.0
tfx-bsl==1.14.0
threadpoolctl==3.2.0
tinycss2==1.2.1
toml==0.10.2
tomli==2.0.1
tornado==6.4
tqdm==4.66.1
traitlets==5.14.1
typeguard==2.13.3
typer==0.9.0
types-python-dateutil==2.8.19.20240106
typing_extensions==4.5.0
uri-template==1.3.0
uritemplate==3.0.1
urllib3==1.26.18
virtualenv==20.25.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
Werkzeug==3.0.1
widgetsnbextension==3.6.6
wrapt==1.16.0
zstandard==0.22.0
I would like to express my gratitude for your confirmation.
May I request further information from you so that we can conduct a more thorough investigation into this matter? Since we are unable to reproduce the issue on our end (despite the fact that numerous users are encountering the same problem), we require additional input regarding your specific situation.
Could you kindly provide responses to the following questions:
1. Did this phenomenon occur prior to TFX version 1.14.0? If not, we can confirm that this is an issue with the TFX codebase, which will allow us to narrow down our investigation. 2. Could you please provide more detailed information about your running environment? I would like to have the output of the `pip freeze` command in its entirety so that I can attempt to reproduce the issue in my own environment. 3. In the scenario you described, would it be possible for you to provide me with a simple example code that reproduces the error? Even a very brief pipeline with a single component would be sufficient.
Thank you very much for your assistance.
Hi,
TFX==1.12.0. The problem is for any standard TFX component. absl-py==1.4.0 aiohttp-cors==0.7.0 aiorwlock==1.3.0 ansiwrap==0.8.4 apache-beam==2.45.0 astunparse==1.6.3 asynctest==0.13.0 attrs==20.3.0 Babel==2.12.1 backoff==2.2.1 blessed==1.20.0 cachetools==4.2.4 certifi==2023.7.22 click==8.1.7 cloud-tpu-client==0.10 cloud-tpu-profiler==2.4.0 cloudpickle==2.2.1 colorama==0.4.6 colorful==0.5.5 comm==0.1.4 conda==22.9.0 crcmod==1.7 cycler==0.11.0 Cython==3.0.2 dacite==1.8.1 db-dtypes==1.1.1 Deprecated==1.2.14 dill==0.3.1.1 distlib==0.3.7 dm-tree==0.1.8 docker==4.4.4 docopt==0.6.2 docstring-parser==0.15 etils==0.9.0 explainable-ai-sdk==1.3.3 Farama-Notifications==0.0.4 fastapi==0.103.1 fastavro==1.8.0 fasteners==0.19 filelock==3.12.2 flatbuffers==2.0.7 fonttools==4.38.0 fsspec==2023.1.0 future==0.18.3 gast==0.3.3 gcsfs==2023.1.0 gitdb==4.0.10 GitPython==3.1.37 google-api-core==1.34.0 google-api-python-client==1.8.0 google-apitools==0.5.31 google-auth-httplib2==0.1.1 google-auth-oauthlib==0.4.6 google-cloud-aiplatform==1.17.1 google-cloud-artifact-registry==1.8.3 google-cloud-bigquery==2.34.4 google-cloud-bigquery-storage==2.16.2 google-cloud-bigtable==1.7.3 google-cloud-dlp==3.9.2 google-cloud-language==1.3.2 google-cloud-monitoring==2.15.1 google-cloud-pubsub==2.13.11 google-cloud-pubsublite==1.6.0 google-cloud-recommendations-ai==0.7.1 google-cloud-resource-manager==1.6.3 google-cloud-spanner==3.26.0 google-cloud-storage==2.11.0 google-cloud-videointelligence==1.16.3 google-cloud-vision==3.1.4 google-crc32c==1.5.0 google-pasta==0.2.0 google-resumable-media==2.6.0 gpustat==1.0.0 greenlet==2.0.2 grpc-google-iam-v1==0.12.6 grpcio==1.58.0 gviz-api==1.10.0 gymnasium==0.28.1 h11==0.14.0 h5py==2.10.0 hdfs==2.7.2 htmlmin==0.1.12 httplib2==0.20.4 ImageHash==4.3.1 imageio==2.31.2 importlib-resources==5.12.0 ipython-genutils==0.2.0 ipython-sql==0.5.0 ipywidgets==7.8.1 jaraco.classes==3.2.3 jax-jumpy==1.0.0 jeepney==0.8.0 Jinja2==2.11.3 joblib==1.3.2 json5==0.9.14 jupyter-http-over-ws==0.0.8 jupyter-server-mathjax==0.2.6 jupyter-server-proxy==3.2.2 jupyterlab==3.4.8 jupyterlab-widgets==1.1.7 jupyterlab_git==0.43.0 jupyterlab_server==2.24.0 jupytext==1.15.2 keras==2.11.0 keras-core==0.0.0 Keras-Preprocessing==1.1.2 keras-tuner==1.4.1 keyring==24.1.1 keyrings.google-artifactregistry-auth==1.1.2 kfp==2.6.0 kfp-pipeline-spec==0.3.0 kfp-server-api==2.0.5 kiwisolver==1.4.5 kt-legacy==1.0.5 kubernetes==11.0.0 libclang==16.0.6 llvmlite==0.39.1 lz4==4.3.2 Markdown==3.4.4 markdown-it-py==2.2.0 MarkupSafe==2.0.1 matplotlib==3.5.3 mdit-py-plugins==0.3.5 mdurl==0.1.2 mistune==0.8.4 ml-metadata==1.12.0 ml-pipelines-sdk==1.12.0 more-itertools==9.1.0 msgpack==1.0.5 multimethod==1.9.1 nbclient==0.5.13 nbconvert==6.4.5 nbdime==3.2.0 networkx==2.6.3 numba==0.56.4 numpy==1.21.6 nvidia-ml-py==11.495.46 oauth2client==4.1.3 oauthlib==3.2.2 objsize==0.6.1 opencensus==0.11.3 opencensus-context==0.1.3 opentelemetry-api==1.20.0 opentelemetry-exporter-otlp==1.20.0 opentelemetry-exporter-otlp-proto-common==1.20.0 opentelemetry-exporter-otlp-proto-grpc==1.20.0 opentelemetry-exporter-otlp-proto-http==1.20.0 opentelemetry-proto==1.20.0 opentelemetry-sdk==1.20.0 opentelemetry-semantic-conventions==0.41b0 opt-einsum==3.3.0 orjson==3.9.7 overrides==6.5.0 packaging==20.9 pandas==1.3.5 pandas-profiling==3.6.6 papermill==2.4.0 patsy==0.5.3 phik==0.12.3 Pillow==9.5.0 platformdirs==3.10.0 plotly==5.17.0 pluggy==1.2.0 portpicker==1.6.0 prettytable==3.7.0 promise==2.3 proto-plus==1.22.3 protobuf==3.20.1 py-spy==0.3.14 pyarrow==6.0.1 pydantic==1.10.12 pydot==1.4.2 pyfarmhash==0.3.2 PyJWT==2.8.0 pymongo==3.13.0 pyparsing==3.1.1 pytz==2023.3.post1 PyWavelets==1.3.0 PyYAML==5.4.1 ray==2.7.0 ray-cpp==2.7.0 regex==2023.8.8 requests-oauthlib==1.3.1 requests-toolbelt==0.10.1 retrying==1.3.3 rich==13.5.3 scikit-image==0.19.3 scikit-learn==1.0.2 scipy==1.7.3 seaborn==0.12.2 SecretStorage==3.3.3 simpervisor==0.4 smart-open==6.4.0 smmap==5.0.1 SQLAlchemy==2.0.21 sqlparse==0.4.4 starlette==0.27.0 statsmodels==0.13.5 tabulate==0.9.0 tangled-up-in-unicode==0.2.0 tenacity==8.2.3 tensorboard==2.11.2 tensorboard-data-server==0.6.1 tensorboard-plugin-profile==2.13.1 tensorboard-plugin-wit==1.8.1 tensorboardX==2.6 tensorflow==2.11.0 tensorflow-cloud==0.1.16 tensorflow-data-validation==1.12.0 tensorflow-datasets==4.8.2 tensorflow-estimator==2.11.0 tensorflow-hub==0.9.0 tensorflow-io==0.29.0 tensorflow-io-gcs-filesystem==0.29.0 tensorflow-metadata==1.12.0 tensorflow-model-analysis==0.43.0 tensorflow-probability==0.19.0 tensorflow-serving-api==2.11.0 tensorflow-transform==1.12.0 termcolor==2.3.0 testpath==0.6.0 textwrap3==0.9.2 tfx==1.12.0 tfx-bsl==1.12.0 threadpoolctl==3.1.0 tifffile==2021.11.2 toml==0.10.2 tomli==2.0.1 tqdm==4.66.1 typeguard==2.13.3 typer==0.9.0 uritemplate==3.0.1 uvicorn==0.22.0 virtualenv==20.21.0 visions==0.7.5 watchfiles==0.20.0 Werkzeug==2.1.2 widgetsnbextension==3.6.6 witwidget==1.8.1 wordcloud==1.9.2 wrapt==1.15.0 ydata-profiling==4.5.1
Thanks for providing your environments!
However, I was not able to reproduce the phenomenon for both configurations, using the VAI example running locally:
I think some configurations, not the TFX, are outdated so the logs are not displayed. Let me contact to Vertex AI team engineer internally to figure out the problem. Thank you.
ERROR: Ignored the following yanked versions: 3.0.6, 3.5.0, 3.7.0, 3.17.0, 4.0.0, 4.0.1, 4.0.2, 4.0.3, 4.0.4, 4.0.5, 4.0.7, 4.0.8, 4.0.9, 4.1.2, 4.1.6, 4.2.6, 4.2.7, 4.3.13, 4.3.16
ERROR: Ignored the following versions that require a different python version: 2.10.0 Requires-Python >=2.7,<3.0; 2.3.0 Requires-Python >=2.7,<3.0; 2.4.0 Requires-Python >=2.7,<3.0; 2.5.0 Requires-Python >=2.7,<3.0; 2.6.0 Requires-Python >=2.7,<3.0; 2.7.0 Requires-Python >=2.7,<3.0; 2.8.0 Requires-Python >=2.7,<3.0; 2.9.0 Requires-Python >=2.7,<3.0
ERROR: Could not find a version that satisfies the requirement conda==22.9.0 (from versions: none)
ERROR: No matching distribution found for conda==22.9.0
@lego0901 Hi, thank you for pursuing this. I see you cannot reproduce with the Penguin Example. I will try to reproduce with a simple pipeline to further aid problem determination...
@lego0901 Hi, I want to add that the same problem occurs for Apache Beam jobs in Dataflow. No logs are displayed. So far, except Kubeflow all other pipeline types I tested (TFX, Dataflow/Beam), does not produce any logs.
We had the same problem when we initially migrated to Vertex from Kubeflow, but only in our production GCP project. Logs worked fine in the testing project.
It took a long investigation to discover that the cause was the default logging bucket for the production GCP project was disabled. Enabling logging bucket fixed the problem.
We had the same problem when we initially migrated to Vertex from Kubeflow, but only in our production GCP project. Logs worked fine in the testing project.
It took a long investigation to discover that the cause was the default logging bucket for the production GCP project was disabled. Enabling logging bucket fixed the problem.
Hi @IzakMaraisTAL Where the default logging bucket is defined and how you enabled it? In Vertex AI pipeline if I do not use TFX, no special setting has to be made in order to view the logs.
@crbl1122 , it might be that our issue is different from what you are seeing. We have not tested any non-TFX Vertex AI pipelines.
The following instructions are LLM generated, but I double checked them in our testing project and they seem correct:
To re-enable the default logging buckets for a Google Cloud Platform (GCP) project, you can follow these steps:
- Open the Google Cloud Console: Go to the Google Cloud Console and select the project for which you want to enable logging.
- Navigate to Logging: In the navigation menu, go to Logging under the Operations section.
- Logs Storage: Click on Logs Storage in the left-hand menu. This will show you the list of logging buckets available in your project.
- Check the Status: Ensure that the logging bucket you are interested in is not disabled. If it is disabled, you will need to enable it.
- Modify Bucket Settings:
- Click on the logging bucket that you want to modify.
- In the bucket details page, look for an option to Edit or Enable the bucket.
- Make sure that the bucket is set to receive logs. You might need to adjust the settings to ensure it's capturing the appropriate log types and from the correct sources.
- Verify Configuration: After enabling the bucket, it's a good idea to verify that logs are being stored correctly. You can do this by checking for recent logs in the Log Explorer.
@IzakMaraisTAL Many thanks for the info.
If the bug is related to a specific library below, please raise an issue in the respective repo directly: TFX
TensorFlow Data Validation Repo
TensorFlow Model Analysis Repo
TensorFlow Transform Repo
TensorFlow Serving Repo
System information
pip freeze
output):Describe the current behavior
I am running in GCP Vertex AI Kubeflow pipelines with TFX components. The problem is that no component logs are displayed in the Vertex interface (neither main job nor pipeline job) while in the Logs Explorer only framework messages are displayed. This is irrespective of the component type (ExamplesGen, Trainer, Transform, etc) and leads to very difficult blindly debugging of TFX components. I submit the pipelines using a service account which has Logs Writer/Reader privileges.
Describe the expected behavior Be able to view the component logs for code debugging.
Standalone code to reproduce the issue
Providing a bare minimum test case or step(s) to reproduce the problem will greatly help us to debug the issue. If possible, please share a link to Colab/Jupyter/any notebook.
Name of your Organization (Optional)
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.