Closed mariokostelac closed 2 months ago
Ok I've found that the same script and (what is supposed to be) the same environment (GPUs are the asme) doesn't fail on SageMaker training (batch interface), but fails on SageMaker Studio (interactive interface).
I'll start printing output of
python -m "torch.utils.collect_env"
nvidia-smi
to see whether there are some notable differences.
What'd be the best way to find usual culprits in environment differences?
To identify potential differences in the environment between SageMaker training and SageMaker Studio, you can print the output of the following commands:
bash Copy code python -m "torch.utils.collect_env" nvidia-smi Comparing the output from these commands in both environments may reveal any variations that could be causing the script to fail in SageMaker Studio. Look for differences in Python packages, CUDA versions, or GPU information. This approach helps pinpoint environmental factors contributing to the issue.
I've found that SageMaker Studio runs older drivers and cuda 11.8. It also ends up having less memory available (~200MB) for the same GPU. AWS responded that it's caused by some internal complexity (haven't disclosed what exactly).
SageMaker jobs get newer drivers, cuda 12, and extra 200MB GPU memory available so that run succeeds. I think running with parameters I chose was just on the edge of available gpu ram, and reducing it by 200MB tipped it over.
If it's expected that 2nd epoch needs a bit more gpu RAM, I think we can close the issue. I've spent a bit of time looking into it and found that PEFT cloning consumes a bit of RAM. It's unclear why it's not returned back to the pool before 2nd epoch starts, but that might be expected.
@mariokostelac this is not expected that second consume more memory however, memory allocation of PyTorch may impact it by fragmenting the memory, I wonder if using this flag would help you further.
Please feel free to re-open if you still have an issue!
System Info
System info
Information
🐛 Describe the bug
I'm running a PEFT finetuning on 13b model (all setting visible in logs) and it's OOM-ing on the first backward pass of the second epoch.
What confuses me most is that
The script I use is modified finetuning.py, but the only difference is in loading config from yaml (similar to axolotl). Final config dataclasses are printed in stdout logs (attached below).
Stdout logs (all settings visible there)
Full std err output
Error logs
Expected behavior
Peak memory usage stays the same in different epochs and training finishes successfully.