Closed BedirT closed 4 months ago
@BedirT thanks so much for sharing this! Glad to here the perf matches up with some of the other libraries you use.
What is the context length for the vRAM reportings on the README?
The README table includes numbers from the default configs which currently train using alpaca. This will be substantially shorter than the context lengths you're training on.
I set batch_size=1
and ran lora_finetune_single_device
with dummy tensors of shape [1, 2048] and [1, 4096]. Both of these seem to run under 24GB peak memory with 4096 right at ~24GB (caveat I'm simulating on an A100, can test on 4090 in a bit). You're absolutely right that this will OOM on seq len 8192. I think we need to do some work on supporting large sequence lengths. Let me take a look at this and get back to you.
It would be great to include the context length info in vRAM reportings at README for reference.
@HaisongDing @BedirT , I see that we updated the readme to include the context length. If you still have questions, please feel free to reopen this issue. Thanks!! :)
I am testing TorchTune with some settings I trained my models on. My go-to single-device library was unsloth, as it provides great memory and time savings.
Based on my Llama 3 8B comparisons, the fine-tuning speed looks very comparable. However, unlike unsloth, I am getting an OOM when trying larger context sizes. I am using the default Lora recipe with a single device on a Docker container.
Do you think there could be something I did wrong, or is this expected? What is the context length for the vRAM reportings on the README?