opendatahub-io / caikit-tgis-serving

Apache License 2.0
17 stars 39 forks source link

Caikit/TGIS swallows model loading running out of memory #92

Closed kpouget closed 4 months ago

kpouget commented 9 months ago

When trying to load a model in a Pod running with a memory limit too low, the out-of-memory error message is swallowed by TGIS and hard to troubleshoot (in addition to Caikit swallowing the TGIS error):

2023-09-26T09:40:45.259993Z  INFO text_generation_launcher: Starting shard 0
Shard 0: supports_causal_lm = False, supports_seq2seq_lm = True
2023-09-26T09:40:55.279072Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-09-26T09:40:57.571196Z ERROR text_generation_launcher: Shard 0 failed to start:

2023-09-26T09:40:57.571219Z  INFO text_generation_launcher: Shutting down shards
{"channel": "TGISPROC", "exception": null, "level": "error", "log_code": "<MTS11752287E>", "message": "exception raised: RuntimeError('TGIS failed to boot up with the model. See logs for details')", "num_indent": 0, "thread_id": 140590947739392, "timestamp": "2023-09-26T09:40:59.288074"}

while troubleshooting it, I observed that even TGIS return code does not refect the OOM error, although my attemps confirmed that not giving enough memory was the cause of the load failure:

sh-4.4$ text-generation-launcher --num-shard 1 --model-name /mnt/models/flan-t5-large/artifacts/ --port 3000;
2023-09-26T11:42:33.150862Z  INFO text_generation_launcher: Launcher args: Args { model_name: "/mnt/models/flan-t5-large/artifacts/", revision: None, deployment_framework: "hf_transformers", dtype: None, dtype_str: Some("float16"), num_shard: Some(1), max_concurrent_requests: 150, max_sequence_length: 4096, max_new_tokens: 1024, max_batch_size: 256, max_batch_weight: Some(47458400), max_prefill_weight: None, max_waiting_tokens: 24, port: 3000, grpc_port: 8033, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, json_output: false, tls_cert_path: None, tls_key_path: None, tls_client_ca_cert_path: None, output_special_tokens: false, cuda_process_memory_fraction: 1.0 }
2023-09-26T11:42:33.151097Z  INFO text_generation_launcher: Starting shard 0
Shard 0: supports_causal_lm = False, supports_seq2seq_lm = True
2023-09-26T11:42:43.180572Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-09-26T11:42:50.384697Z ERROR text_generation_launcher: Shard 0 failed to start:

2023-09-26T11:42:50.384723Z  INFO text_generation_launcher: Shutting down shards
sh-4.4$ echo $?
1
Xaenalt commented 9 months ago

This likely should go against caikit/caikit-nlp

Xaenalt commented 9 months ago

Also we'll get separated logs once the container split happens (this sprint)

danielezonca commented 9 months ago

This is the ticket for reference :)

heyselbi commented 7 months ago

@kpouget could you share an update on this once you try it with the new SR with split images of Caikit and TGIS?

kpouget commented 7 months ago

@heyselbi , it didn't change AFAICT:

NAME                                                       READY   STATUS    RESTARTS   AGE
gpt-neox-20b-predictor-00001-deployment-79c9c4d7b8-tzc6s   4/4     Running   0          51m
  modelStatus:
    copies:
      failedCopies: 0
      totalCopies: 1
    states:
      activeModelState: Loaded
      targetModelState: Loaded
    transitionStatus: UpToDate

but

$ grpcurl    -insecure    -d "$GRPCURL_DATA"    -H "mm-model-id: gpt-neox-20b"    gpt-neox-20b-predictor-memory.apps.psap-watsonx-dgxa100.perf.lab.eng.bos.redhat.com:443    caikit.runtime.Nlp.NlpService/TextGenerationTaskPredict
ERROR:
  Code: Unknown
  Message: Request failed during generation: Unexpected <class 'torch.cuda.OutOfMemoryError'>: CUDA out of memory. 
Tried to allocate 14.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 5.94 MiB is free. Process 3883975 has 39.38 GiB 
memory in use. Of the allocated memory 38.78 GiB is allocated by PyTorch, and 100.96 MiB is reserved by PyTorch but 
unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See 
documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Xaenalt commented 7 months ago

Container didn't crash when it got an OOM error?

Xaenalt commented 7 months ago

Oh, if it's just GPU memory, it probably won't, but it probably should... Hmmm... I'd say this is probably covered at least on startup by the upcoming readiness probe

Xaenalt commented 7 months ago

Will be resolved by #156

dtrifiro commented 4 months ago

TGIS now lives in a separate container, and following its logs should show the OOM errors.

For proper liveness/readiness probes for the tgis container in the caikiit+tgis setup, we'll have to wait for https://github.com/knative/serving/pull/14853.