tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.github.io/tfx/
Apache License 2.0
2.12k stars 711 forks source link

Running the classify_local.sh crashes the tensorflow serving container #5518

Closed ziadloo closed 1 year ago

ziadloo commented 2 years ago

System information

absl-py==1.3.0
anyio==3.6.2
apache-beam==2.43.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
astunparse==1.6.3
attrs==21.4.0
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.1
cachetools==4.2.4
certifi==2022.9.24
cffi==1.15.1
charset-normalizer==2.1.1
click==7.1.2
cloudpickle==2.2.0
crcmod==1.7
debugpy==1.6.3
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.1.1
docker==4.4.4
docopt==0.6.2
entrypoints==0.4
fastavro==1.7.0
fasteners==0.18
fastjsonschema==2.16.2
flatbuffers==1.12
gast==0.4.0
google-api-core==1.32.0
google-api-python-client==1.12.11
google-apitools==0.5.31
google-auth==1.35.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-cloud-aiplatform==1.19.0
google-cloud-bigquery==2.34.4
google-cloud-bigquery-storage==2.13.2
google-cloud-bigtable==1.7.3
google-cloud-core==2.3.2
google-cloud-datastore==1.15.5
google-cloud-dlp==3.9.2
google-cloud-language==1.3.2
google-cloud-pubsub==2.13.11
google-cloud-pubsublite==1.6.0
google-cloud-recommendations-ai==0.7.1
google-cloud-resource-manager==1.6.3
google-cloud-spanner==3.23.0
google-cloud-storage==2.6.0
google-cloud-videointelligence==1.16.3
google-cloud-vision==1.0.2
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.4.0
googleapis-common-protos==1.57.0
grpc-google-iam-v1==0.12.4
grpcio==1.51.0
grpcio-status==1.48.2
h5py==3.7.0
hdfs==2.7.0
httplib2==0.20.4
idna==3.4
importlib-metadata==5.0.0
importlib-resources==5.10.0
ipykernel==6.17.1
ipython==7.34.0
ipython-genutils==0.2.0
ipywidgets==7.7.2
jedi==0.18.2
Jinja2==3.1.2
joblib==0.14.1
jsonschema==4.17.0
jupyter-server==1.23.3
jupyter_client==7.4.7
jupyter_core==5.0.0
jupyterlab-pygments==0.2.2
jupyterlab-widgets==1.1.1
keras==2.9.0
Keras-Preprocessing==1.1.2
keras-tuner==1.1.3
kt-legacy==1.0.4
kubernetes==12.0.1
libclang==14.0.6
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib-inline==0.1.6
mistune==2.0.4
ml-metadata==1.10.0
ml-pipelines-sdk==1.10.0
nbclassic==0.4.8
nbclient==0.7.0
nbconvert==7.2.5
nbformat==5.7.0
nest-asyncio==1.5.6
notebook==6.5.2
notebook_shim==0.2.2
numpy==1.22.4
oauth2client==4.1.3
oauthlib==3.2.2
objsize==0.5.2
opt-einsum==3.3.0
orjson==3.8.2
overrides==6.5.0
packaging==20.9
pandas==1.5.1
pandocfilters==1.5.0
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
pkgutil_resolve_name==1.3.10
platformdirs==2.5.4
portpicker==1.5.2
prometheus-client==0.15.0
prompt-toolkit==3.0.33
proto-plus==1.22.1
protobuf==3.19.6
psutil==5.9.4
ptyprocess==0.7.0
pyarrow==6.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pydot==1.4.2
pyfarmhash==0.3.2
Pygments==2.13.0
pymongo==3.13.0
pyparsing==3.0.9
pyrsistent==0.19.2
python-dateutil==2.8.2
pytz==2022.6
PyYAML==5.4.1
pyzmq==24.0.1
regex==2022.10.31
requests==2.28.1
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.9.3
Send2Trash==1.8.0
six==1.16.0
sniffio==1.3.0
soupsieve==2.3.2.post1
sqlparse==0.4.3
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.9.3
tensorflow-data-validation==1.10.0
tensorflow-estimator==2.9.0
tensorflow-hub==0.12.0
tensorflow-io-gcs-filesystem==0.27.0
tensorflow-metadata==1.10.0
tensorflow-model-analysis==0.41.1
tensorflow-serving-api==2.9.2
tensorflow-transform==1.10.1
termcolor==2.1.1
terminado==0.17.0
tfx==1.10.0
tfx-bsl==1.10.1
tinycss2==1.2.1
tornado==6.2
traitlets==5.5.0
typing_extensions==4.4.0
uritemplate==3.0.1
urllib3==1.26.12
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.4.2
Werkzeug==2.2.2
widgetsnbextension==3.6.1
wrapt==1.14.1
zipp==3.10.0
zstandard==0.19.0

Describe the current behavior

After training a model using the python tfx/examples/chicago_taxi_pipeline/taxi_pipeline_local.py command, I see the saved_model.pb file being created. Then I try to test the model by sending an inference request. While the serving container starts successfully, when I try to request an inference, the container crashes with the following error message:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/usr/bin/tf_serving_entrypoint.sh: line 3:     7 Aborted                 (core dumped) tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"

Describe the expected behavior

I don't know if this is a bug within the training pipeline, the serving container, or the client code sending the request. But the container should return the response without crashing.

Standalone code to reproduce the issue

1. Set up an environment with python 3.8.15 and the given packages. 2. Create the folders:

mkdir -p ~/taxi/data/simple
mkdir -p ~/tfx

3. Download the dataset:

wget -O ~/taxi/data/simple/data.csv \
https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi_pipeline/data/simple/data.csv?raw=true

4. Clone the repo:

git clone https://github.com/tensorflow/tfx.git repo

5. Run the training script:

python ~/repo/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_local.py

6. Make the serving script executable:

chmod +x ~/repo/tfx/examples/chicago_taxi_pipeline/serving/start_model_server_local.sh

7. Run the serving container:

~/repo/tfx/examples/chicago_taxi_pipeline/serving/start_model_server_local.sh \
~/taxi/serving_model/chicago_taxi_beam 

8. As mentioned in issue #3563, the file tfx/examples/chicago_taxi_pipeline/serving/chicago_taxi_client.py needs to be edited before we can proceed. Currently, line 185 reads:

  return parser.parse_args(argv)

while it should be:

  return parser.parse_args(argv[1:])

9. Make the bash script file executable:

chmod +x ~/repo/tfx/examples/chicago_taxi_pipeline/serving/classify_local.sh

10. Run the inference bash script:

~/repo/tfx/examples/chicago_taxi_pipeline/serving/classify_local.sh \
~/taxi/data/simple/data.csv \
~/tfx/pipelines/chicago_taxi_beam/SchemaGen/schema/5/schema.pbtxt

Please pay attention that the location of the schema.pbtxt file on your computer might be different (5 is an auto-generated number and each time you run the training pipeline, a new folder is generated).

Once you send the request, the server crashes, and the container exits.

ziadloo commented 2 years ago

I can also add that training the model using the Keras script will result in the same issue. All you need to do is to use the following script to train the model:

python ~/repo/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_native_keras.py

But be aware that I had to change a line of code in that script to make it work. Line 48 originally reads:

_module_file = os.path.join(_taxi_root, 'taxi_utils_native_keras.py')

And I had to change it to:

_module_file = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'taxi_utils_native_keras.py')

Perhaps, I can make it a PR. Only if someone could help me understand why these are failing, then I could fix them all in a single PR. That would be great.

ziadloo commented 2 years ago

I managed to get a prediction response successfully by manually editing the CSV file. The CSV file is used to compose the request payload (as the data that is fed into the inference engine). The tfx/examples/chicago_taxi_pipeline/serving/chicago_taxi_client.py script which constructs the request, uses the first 3 records of the CSV file by encoding them into a proto-buffer.

So far, all is good. It's just that the first 3 records have some fields missing. This results in a proto-buffer payload with missing features. Apparently, TensorFlow serving does not like missing features and it crashes. If you ask me, this needs to be fixed from the Tensorflow serving side but that does not mean that this project is all safe.

The fact that someone has chosen that specific dataset to train and test this library with is something that needs to be revisited here. I mean a working example is a must, don't you think?

Bottom line:

The reason why sending the inference request crashes the serving is that the payload has missing features in it.

gaikwadrahul8 commented 2 years ago

Hi, @ziadloo

Apologies for the delay and Good to hear that you're able to solve your issues with Tensorflow serving and certainly we'll look into it and will make sure our examples upto date and working, thank you for your valuable suggestions, We really appreciate your efforts and time

If you're looking to explore more things with TFX end to end pipeline, here is good example and for Tensorflow serving here and you can refer official documentation here with more serving options.

If your issue got resolved, Could you please close this issue? or if you need any further assistance, please let us know ?

Thank You!

gaikwadrahul8 commented 1 year ago

Hi, @ziadloo

Closing this issue due to lack of recent activity for couple of weeks. Please feel free to reopen the issue, if you need any further assistance or update

Thank you!

google-ml-butler[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No