Closed Moximixi closed 11 months ago
Hi @Moximixi
I used LLaMA-1 (decapoda-research-llama-30b-hf) -- I believe it's 30B.
If you run at using 16-bit you will need around 60GB, or 120GB for full precision. Running inference just to get the probabilities (instead of doing decoding inference) can be fast and I actually ran this experiment on CPU (should be done in 10-30 mins depending on your machine).
As far as I know, the LLaMA model contains four versions: 7b, 13b, 33b and 65b. Which version does the figure refer to? Another question is what type of GPU is used to run llama_logrob_inference.py?