Open monk1337 opened 5 months ago
This is something we're working closely with the EleutherAI team on providing soon. For now, if you have enough RAM (and patience) you can try running on CPU - this will likely take a looooong time. You can also try using the accelerate
library for now by following the instructions here: https://github.com/EleutherAI/lm-evaluation-harness#multi-gpu-evaluation-with-hugging-face-accelerate.
Stay tuned for a torchtune native multi-GPU evaluation feature soon!
Awesome, and thank you for the reply! I am excited about the new feature, but in the meantime, I want to try the native LM harness. However, to do that, I need to convert TorchTune weights into HF weights. I am having issues with the conversion for the 70B model, so I have opened another issue for that. Please take a look when you have a chance. https://github.com/pytorch/torchtune/issues/922
I am a heavy user of Axolotl and TRL but am now switching to TorchTune. I anticipate encountering some bugs during this transition, so I will be opening issues as I come across them. :) Additionally, I would be happy to contribute in any way that I can.
An approach was proposed for multi-gpu eval via EleutherAI in #951
I am trying to eval the finetuned model 70B with torch run and getting error
Here is my config file
when running with this command
tune run eleuther_eval --config evalconfig.yml
getting this error
""" File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/parameter.py", line 59, in deepcopy result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 78, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU"""
when tried with
tune run --nproc_per_node 8 eleuther_eval --config evalconfig.yml
it's giving another error
tune run: error: Recipe eleuther_eval does not support distributed training.Please run without torchrun commands.
How to evaluate large models with torchtune?