Closed WuNein closed 1 year ago
Just a random guess as I haven't tried yet, but is ECC enabled on your card? Try disabling it with nvidia-smi -e 0
, re-enable with nvidia-smi -e 1
.
Just a random guess as I haven't tried yet, but is ECC enabled on your card? Try disabling it with
nvidia-smi -e 0
, re-enable withnvidia-smi -e 1
.
4090 do not have ecc... the first checkpoint take 13G, the rest is not enough.
change max_seq_len to 256, which is able to run 13B on 4090.
I am using the latest version of nvidia-docker of pytorch, with support for cuda 12. I complie the cuda 118 version of bit lib, since the code require bitxxx_cuda118.so . Tested on 7B version, OK. 13B, CUDA out of memory. About 1-2G less.
No OOM error, 64Gb memory installed.
I doubt whether RTX4090 can actually run 13B model. Please share more detailed imformation of your device.