Closed bjascob closed 2 years ago
Hi, thanks for the bug report!
TL;DR to
and pin_memory
are not implemented yet.
By default, when a method is not defined on the LazyTensor
, it looks up in a dictionary to see if there's a pre-defined shape function so that it knows the memory usage of that particular operation. If that operation is not found, the method/function is evaluated immediately just to be safe. If that happens before the final loss.backward
, then unfortunately koila doesn't not prevent OOM from happening. And because it doesn't yet handle moving across devices after it's created, the to
method is not defined.
I'm working on making the API compatible with PyTorch as soon as I can, but PyTorch supports a ton of operations, and sadly there is just so much time. In the meantime, I would advise against using it in production environment without thorough testing.
Just thought I'd give it a try. Sound like a nice library once more functions get implemented.
I tried this with a HuggingFace transformers model and set my batch size artificially large. Initially I saw the following before OOM memory.
I changed the option of
dataloader_pin_memory = False
and got a little farther.This was resolved by moving the data to the GPU (calling .to('cuda:0')) in the collator ( this is done in the model). The next error was..
This one I'm not sure how to resolve and I'm not certain that "Evaluating eagerly" is even the issue. However, after the first one of those debug statements I see the OOM error. Any advice?