Open shub-kris opened 7 months ago
seems like it crashed in https://github.com/pytorch/xla/blob/cb4983e93d70319db56440872567e2dc98d0ce1f/torch_xla/csrc/tensor_methods.cpp#L354-L370 ...
@will-cromar can you take a look?
@alanwaketan can you please also have a look here?
@alanwaketan do you normally use the HuggingFace Trainer
? I remember people have had issues using it with XLA before. I ran through two of the example tutorials last week while working on #6584, and the Trainer
-based examples had issues on TPU, but the accelerate
-based examples did work fine.
I tried to reproduce your crash on v4-8 with torch
and torch_xla
built from head and got a different crash: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 32.45M of 16.00M vmem. Exceeded vmem capacity by 16.45M.
I do believe the normal torch.save should be compatible with FSDP. cc @jonb377 who is our ckpt expert.
@alanwaketan do you normally use the HuggingFace
Trainer
? I remember people have had issues using it with XLA before. I ran through two of the example tutorials last week, and theTrainer
-based ones had issues on TPU, but theaccelerate
-based examples and scripts did work fine.I tried to reproduce your crash on v4-8 with
torch
andtorch_xla
built from head and got a different crash:RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 32.45M of 16.00M vmem. Exceeded vmem capacity by 16.45M.
Yea, I do. All the Llama and Gemma works are done with HF trainer. But I don't recall we hit this issue before.
Okay, I just scanned through the script and it looks like it has nothing to do with SPMD @jonb377. Itβs probably just simple DPβ¦ Have no ideas why this will crash but we probably wonβt be able to spend too much time on debugging this given mp is about to deprecate.
it also crash normally with Phi2, and even SD, tested on the TPUv4-8 :(
it also crash normally with Phi2, and even SD, tested on the TPUv4-8 :(
Do you use DP or FSDP?
hi @alanwaketan
I think it is high related to the HF
accelerate lib, will continue verification
@alanwaketan do you normally use the HuggingFace
Trainer
? I remember people have had issues using it with XLA before. I ran through two of the example tutorials last week while working on #6584, and theTrainer
-based examples had issues on TPU, but theaccelerate
-based examples did work fine.I tried to reproduce your crash on v4-8 with
torch
andtorch_xla
built from head and got a different crash:RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 32.45M of 16.00M vmem. Exceeded vmem capacity by 16.45M.
Hi. I encountered exact same issue as you did; even the vmem numbers are the exact same, and I tested with different llm with generate()
, all hitting the same issue. Have you found a way to solve that?
Hello, @shub-kris , I encountered a similar issue and have fixed it in https://github.com/huggingface/transformers/pull/31264. Could you check if your issue has been resolved?
π Bug
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. root@t1v-n-108b165f-w-0:/workspace# /usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
To Reproduce
Create and SSH intoGoogle Cloud VM:
Install the packages
Run the test-transformers-trainer.py with
Entire Stack Trace
Expected behavior
The code should save the checkpoints successfully.
Environment