Closed yxchng closed 1 year ago
ok. sorry, i was asking about training the model not inferencing.
The 7b model is trained on 256 TPU v4 chips (around the same compute as 256 A100 chips) for 20 days. The 13b model requires double the compute and 3b model requires half of that.
Could you clarify if the compute details given above are for v1 or v2 of OpeLLaMA models? If they are for v2 please also provide pre-train compute details of v1 models.
The compute details are the same for the v1 and v2 models.
For inference, and if your GPU supports INT8 then the 7B parameter model will run in about 9GB of VRAM. About twice that for the 13B model. With 4 bit quantization you can half the VRAM requirements.
If you have less than 9GB of VRAM, you can convert it to llama.cpp GGML format and run in on a hybrid CPU/GPU you can fill your GPU VRAM with as many layers as will fit at quantizations between 2bit and 16bit.