Memory requirements for long sequences

DreamGenX commented 3 months ago

Hey there, when I used AutoFP8 on a dataset with 8192 context window on Llama 3 70B, 4x80G was not enough. When I then did it on 8x80G it worked, but the memory was actually underutilized when you integrate over time -- each card peaked at maybe 70%, but never all the cards at the same time. I wonder if there's a way to fruther improve thigs.

For reference, when using ammo [1], 4x80G is enough for 70B models. Is the algorithm different from the one implemented by ammo?

[1] https://pypi.org/project/nvidia-ammo/

mgoin commented 3 months ago

@DreamGenX This has been a issue I've known about for a while but was unsure of the reason. After a bit of investigation, I luckily found a simple mistake in not properly disabling pytorch gradient computation. This PR should solve this issue https://github.com/neuralmagic/AutoFP8/pull/20 - I tested I was able to perform quantization with 8192 sequence length on Llama 3 70B with just 2x80GB! Thank you for the issue.

DreamGenX commented 3 months ago

Awesome, thank you! @mgoin

Can you also comment on the algo differences compared to AMMO's fp8?

neuralmagic / AutoFP8

Memory requirements for long sequences #19