rentruewang / koila

Prevent PyTorch's `CUDA error: out of memory` in just 1 line of code.
https://rentruewang.com/koila/
Apache License 2.0
1.82k stars 62 forks source link

Issues with "No custom methods found. Evaluating eagerly." #9

Closed bjascob closed 2 years ago

bjascob commented 2 years ago

I tried this with a HuggingFace transformers model and set my batch size artificially large. Initially I saw the following before OOM memory.

DEBUG    __getattr__ called for pin_memory. Automatically resolving function. 
DEBUG    No custom methods found. Evaluating eagerly.  

I changed the option of dataloader_pin_memory = False and got a little farther.

DEBUG    __getattr__ called for to. Automatically resolving function.
DEBUG    No custom methods found. Evaluating eagerly.

This was resolved by moving the data to the GPU (calling .to('cuda:0')) in the collator ( this is done in the model). The next error was..

DEBUG    __getattr__ called for float. Automatically resolving function
DEBUG    No custom methods found. Evaluating eagerly.

This one I'm not sure how to resolve and I'm not certain that "Evaluating eagerly" is even the issue. However, after the first one of those debug statements I see the OOM error. Any advice?

rentruewang commented 2 years ago

Hi, thanks for the bug report!

TL;DR to and pin_memory are not implemented yet.

By default, when a method is not defined on the LazyTensor, it looks up in a dictionary to see if there's a pre-defined shape function so that it knows the memory usage of that particular operation. If that operation is not found, the method/function is evaluated immediately just to be safe. If that happens before the final loss.backward, then unfortunately koila doesn't not prevent OOM from happening. And because it doesn't yet handle moving across devices after it's created, the to method is not defined.

I'm working on making the API compatible with PyTorch as soon as I can, but PyTorch supports a ton of operations, and sadly there is just so much time. In the meantime, I would advise against using it in production environment without thorough testing.

bjascob commented 2 years ago

Just thought I'd give it a try. Sound like a nice library once more functions get implemented.