Closed kpouget closed 4 months ago
This likely should go against caikit/caikit-nlp
Also we'll get separated logs once the container split happens (this sprint)
This is the ticket for reference :)
@kpouget could you share an update on this once you try it with the new SR with split images of Caikit and TGIS?
@heyselbi , it didn't change AFAICT:
NAME READY STATUS RESTARTS AGE
gpt-neox-20b-predictor-00001-deployment-79c9c4d7b8-tzc6s 4/4 Running 0 51m
modelStatus:
copies:
failedCopies: 0
totalCopies: 1
states:
activeModelState: Loaded
targetModelState: Loaded
transitionStatus: UpToDate
but
$ grpcurl -insecure -d "$GRPCURL_DATA" -H "mm-model-id: gpt-neox-20b" gpt-neox-20b-predictor-memory.apps.psap-watsonx-dgxa100.perf.lab.eng.bos.redhat.com:443 caikit.runtime.Nlp.NlpService/TextGenerationTaskPredict
ERROR:
Code: Unknown
Message: Request failed during generation: Unexpected <class 'torch.cuda.OutOfMemoryError'>: CUDA out of memory.
Tried to allocate 14.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 5.94 MiB is free. Process 3883975 has 39.38 GiB
memory in use. Of the allocated memory 38.78 GiB is allocated by PyTorch, and 100.96 MiB is reserved by PyTorch but
unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See
documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Container didn't crash when it got an OOM error?
Oh, if it's just GPU memory, it probably won't, but it probably should... Hmmm... I'd say this is probably covered at least on startup by the upcoming readiness probe
Will be resolved by #156
TGIS now lives in a separate container, and following its logs should show the OOM errors.
For proper liveness/readiness probes for the tgis container in the caikiit+tgis setup, we'll have to wait for https://github.com/knative/serving/pull/14853.
When trying to load a model in a Pod running with a memory limit too low, the out-of-memory error message is swallowed by TGIS and hard to troubleshoot (in addition to Caikit swallowing the TGIS error):
while troubleshooting it, I observed that even TGIS return code does not refect the OOM error, although my attemps confirmed that not giving enough memory was the cause of the load failure: