pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.08k stars 829 forks source link

Model missing error - KServe - PyTorch #3215

Open Csehpi opened 3 weeks ago

Csehpi commented 3 weeks ago

🐛 Describe the bug

Hello,

I would like to ask your help. I am using KServe and would like to deploy a PyTorch model with it.

My problem is that I am getting models missing error messages even if model-store is defined as a command line argument.

args:
          - torchserve
          - '--start'
          - '--model-store=/mnt/models/pytorch/model-store'
          - '--ts-config=/mnt/models/pytorch/config/config.properties'

The log says:

Model Store: /mnt/models/pytorch/model-store

BUT later it tries to use a different path (pytorch is missing from the path, the default one is used):

INFO:root:Copying contents of /mnt/models/model-store to local

The doc says: –model-store Overrides the model_store property in config.properties file

Thanks, Peter

Error logs

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-07-01T07:56:21,363 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-07-01T07:56:21,365 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-07-01T07:56:21,404 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
2024-07-01T07:56:21,456 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.11.0
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 0
Number of CPUs: 3
Max heap size: 1536 M
Python executable: /home/venv/bin/python
Config file: /mnt/models/pytorch/config/config.properties
Inference address: http://0.0.0.0:8085
Management address: http://0.0.0.0:8085
Metrics address: http://0.0.0.0:8082
Model Store: /mnt/models/pytorch/model-store
Initial Models: N/A
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 4
Netty client threads: 0
Default workers per model: 3
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /mnt/models/pytorch/model-store
CPP log config: N/A
Model config: N/A
System metrics command: default
2024-07-01T07:56:21,462 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-07-01T07:56:21,475 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {"name":"startup.cfg","modelCount":1,"models":{"fashionmnist":{"1.0":{"defaultVersion":true,"marName":"fashionmnist.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"responseTimeout":120}}}}
2024-07-01T07:56:21,481 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Validating snapshot startup.cfg
2024-07-01T07:56:21,482 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Snapshot startup.cfg validated successfully
INFO:root:Wrapper: loading configuration from /mnt/models/pytorch/config/config.properties
INFO:root:Wrapper : Model names dict_keys(['fashionmnist']), inference address http://0.0.0.0:8085, management address http://0.0.0.0:8085, grpc_inference_address, 0.0.0.0:7070, model store /mnt/models/model-store
INFO:root:Predict URL set to 0.0.0.0:8085
INFO:root:Explain URL set to 0.0.0.0:8085
INFO:root:Protocol version is v1
INFO:root:Copying contents of /mnt/models/model-store to local
Traceback (most recent call last):
  File "/home/model-server/kserve_wrapper/__main__.py", line 117, in <module>
    model.load()
  File "/home/model-server/kserve_wrapper/TorchserveModel.py", line 159, in load
    raise ModelMissingError(model_path)
kserve.errors.ModelMissingError: <exception str() failed>

Installation instructions

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  ...
spec:
  predictor:
    containers:
      - args:
          - torchserve
          - '--start'
          - '--model-store=/mnt/models/pytorch/model-store'
          - '--ts-config=/mnt/models/pytorch/config/config.properties'
        env:
          - name: CONFIG_PATH
            value: /mnt/models/pytorch/config/config.properties
        image: pytorch/torchserve-kfs:0.11.0
        imagePullPolicy: Always
        name: kserve-container
        resources:
          limits:
            cpu: '3'
            memory: '6442450944'
          requests:
            cpu: '3'
            memory: '6442450944'
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
          - mountPath: /mnt/models
            name: kserve-provision-location
            readOnly: true
    imagePullSecrets:
      - name: ...
    initContainers:
      ...
    volumes:
      - emptyDir: {}
        name: kserve-provision-location

Model Packaging

pytorch (main folder) |--config │----config.properties |--model-store |----xyz.mar

config.properties

inference_address=http://0.0.0.0:8085 management_address=http://0.0.0.0:8085 metrics_address=http://0.0.0.0:8082 grpc_inference_port=7070 grpc_management_port=7071 enable_metrics_api=true metrics_format=prometheus number_of_netty_threads=4 job_queue_size=10 enable_envvars_config=true install_py_dep_per_model=true model_store=... model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"fashionmnist":{"1.0":{"defaultVersion":true,"marName":"fashionmnist.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"responseTimeout":120}}}}

Versions

PyTorch image: pytorch/torchserve-kfs:0.11.0 Kserve: 0.12.1

Repro instructions

Deployed as a KServe InferenceService

Possible Solution

No response

glovass commented 2 weeks ago

We are facing the same issue. Have you found a solution for it?

Csehpi commented 2 weeks ago

Unfortunately no, I am still waiting some help / guidance here.