meta-llama / llama

Inference code for Llama models
Other
56.38k stars 9.57k forks source link

Post your hardware specs here if you got it to work. 🛠 #79

Closed elephantpanda closed 1 year ago

elephantpanda commented 1 year ago

It might be useful if you get the model to work to write down the model (e.g. 7B) and the hardware you got it to run on. Then people can get an idea of what will be the minimum specs. I'd also be interested to know. 😀

EthanLipnik commented 1 year ago

System Mac Mini M1 16GB RAM

Runs 7B and 13B 4bit quantized without problems

Runs 30b extremely slow taking minutes for a word to appear

What's the speed when running 7B vs 13B?

Update: tested it myself and it's not too different on M2 between 7B and 13B but still slower and uses more RAM. Can't get 30B to run but it might just be super slow

alxfoster commented 1 year ago

FYI, running 30B Alpaca-lora finetuned LlaMA model (FP8) across 2x3090's on Ubuntu 22, getting about 3.75 tokens/s on average.

CHPraxis commented 1 year ago

Specs: Ryzen 5600x, 16 gigs of ram, RTX 3060 12gb

With @venuatu 's fork and the 7B model im getting:

46.7 seconds to load. 13.8gb of ram used 1.26gb in swap 5gb in vram and there is one core always at 100% utilization

I have almost the same specs (2700x, 32 GB RAM, RTX 3060 12 GB). What settings did you need to put in on @venuatu 's fork?

kechan commented 1 year ago

@jankais3r I’m @levzlotnik’s friend with the M2 Max (with 96GB of unified memory), and I am getting about 10 tokens/second right now… next step will be to try to port the model to coreml for performance boost, and then see if quantization to int8 helps for performance, in addition to it obviously enabling larger models. The amount of unified RAM on the M2 (and the M1 Ultra) should enable running a quantized 65B model with room to spare.

@levzlotnik Which model size are you getting for 10tok/sec, that's very fast. Is this after running entirely in MPS (after resolving the "complex dtype")?

alxfoster commented 1 year ago

Am now running 65B Llama @ 4-bit 128g and getting around 7 Tokens/second (twice what I got with 30B 8-bit) on 2x3090's with NVlink

acerarfat1997 commented 1 year ago

My PC Specification Window 10 4GB RAM 1 TB HDD Intel i5 7th Gen Intel 620 UHD No additional VRAM is present

Will llama & Alpaca work on my system

jordanparker6 commented 1 year ago

Has anyone tried inference on EC2 Inf2 or Inf1 instances? Curious if it would work or are the cpu optimisations M1/M2 specific? CPU inference has a better price point in the cloud...

Lima1512 commented 1 year ago

I need help finding the right balance between the computer's resources to get the most out of the software

What about:

RTX 3080 AMD Ryzen 9 5900HX, 64GB

Or

RTX 3080 Ti, Intel Core i9 12900H, 32GB

Or

RTX 4070 Intel 24-Core i9 HX 16GB

David-AU-github commented 1 year ago

HP all in one PC, Almost 6 years old. Specs: 8gigs RAM, win 10, 3.4 MHZ, 2 core. Logic processors : 4; 3 caches. NO Vram, No Graphics card.

I was able to run up to 3 billion level models @ 4 bit. This was using GTP4all, KobaldAI and oobabooga_windows and CMI like Alpaca, and Vicunna. It was slow, but it worked. Token rate: roughly 12-20 tokens per minute. (larger models -> 5-10 tokens per minute) ; CMI interfaces were slightly faster. This is with NOTHING else running - no word, browsers, etc etc. Extensive optimization was done on O/S too, including full "cleaning" , "defrags", "reg defrags" etc etc. Also found slight improvement from "fresh restart" after with windows O/S "did it's thing" for about 10 minutes.

New Specs: Replaced ram, now 32 gigs. - Cost $140 AUD.

Been able to run 30Billion models, @ 4 bit - including: OpenAssistant 30 B, Alpaca Lorna 30 B, Alpaca Dente2 30B, and of course 13B, 7B models.

Slight speed improvement, especially smaller models. There is some difference in speed between models of same size too.

For complex prompts time to process -> 30 seconds to 2 minutes ; however once the processing/reply happens the token speed output was very acceptable.

Note I can run any model, while other programs are running - browsers, word, and so on... Open Assistant runs @ 94% memory (but other programs are running too, so take off 10% if running alone).

NOTE: No file swaps - all running in RAM.

Might be able to run 65B models, but likely run into file/memory swap issues.

Most of this was proof of being able to run before getting a new machine specifically for LLM / AIs to run locally.

Planned: 128 gigs ram, 2 VRAM cards (total 24 GIGS) - Nvidia, working on matching motherboard with ram, cpu speed specs - IE DDR5 for ram/motherboard with matching to RAM/CPU speed. Estimated cost $3000-$4000 AUD.

NOTE with this setup should be able to run 30B Openassistant @ 4bit completely in VRAM.

Hope this helps. DAVE

Lima1512 commented 1 year ago

I'm looking for the best laptop for the job. But there is no leptop with more than 16gb vram :-(

So what do you think about 64gb ram and 16gb vram ?

Foul-Tarnished commented 1 year ago

I'm looking for the best laptop for the job. But there is no leptop with more than 16gb vram :-(

So what do you think about 64gb ram and 16gb vram ?

Lol, laptop will just thermal throttle after 2min

jacksutherland commented 1 year ago

What are the "ideal" specs to run 65B on a PC?

Is it possible to build a box with Llama 65B running in a process, that can still perform well as your daily driver? If that's a long-shot, which model would work best for this? And what specs would it take?

rhiskey commented 1 year ago

Okay, what about minimum requirements? What kind of model can run on old servers and how much RAM needed for just only run LLAMA2?

develCuy commented 1 year ago

Trained with SFTTrainer and QLora on Google Colab:

BitAndBytes (double quantize), Mixed Precision training (fp16="02") and gradient+batch sizes of 2 or lower helped out with memory constrains.

If you don't have your own hardware, use Google Colab. This is a good starter:

https://colab.research.google.com/drive/12dVqXZMIVxGI0uutU6HG9RWbWPXL3vts

Lima1512 commented 1 year ago

Do you have a tutorial/ video... ?

mitchec4 commented 1 year ago

I have an M1 MacBook Pro 16" with 16GB ram. it runs both the 7B and 13B models. Loads with no delay and I usually get instant response from both though additional info can take around 5 seconds to appear. They appear to have a wild imagination when it comes to accuracy so take most answers with pinch of salt. Sometimes after asking an initial question It often goes off and starts asking its own questions and then answers itself. I suppose it is presuming these are standard follow up questions that most people will ask.

The 13B model can become unstable after some use. usually get a load of repeating text then it locks up.

iakashpaul commented 1 year ago

Llama2 7B-Chat on RTX 2070S with bitsandbytes FP4, Ryzen 5 3600, 32GB RAM

Completely loaded on VRAM ~6300MB, took ~12 seconds to process ~2200 tokens & generate a summary(~30 tokens/sec).

Llama.cpp for llama2-7b-chat (q4) on M1 Pro works with ~2GB RAM, 17tok/sec

Also ran the same on A10(24GB VRAM)/LambdaLabs VM with similar results

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

model_id = 'meta-llama/Llama-2-7b-chat-hf'

if torch.cuda.is_available():
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map='auto', load_in_4bit=True
    )
wkgcass commented 1 year ago

Llama2 7B-Chat official sample (with exactly the same launching arguments in README)

GPU: 4060Ti 16GB. Consumes more than 14GB RAM: 16GB. The memory usage is about 2GB after the model is loaded

torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4
binaryninja commented 1 year ago

Task: Fine tune Llama2 7B and 13B on a task specific function using my own data GPU: 3090 24GB RAM : 256 GB CPU: 3970X

I have two GPUs but I only wanted to use one so I ran the following in my terminal so the script could only see the first GPU in my system export CUDA_VISIBLE_DEVICES=0

I trained with LORA rank of 32, batch size 1, context length of 4096. After training for 2000 steps I saw a noticeable improvement on the task I was training for, loss went from ~1.8/1.4 for 7B/13B base models to 0.41 / 0.33 after 5000steps and I still have room to go (0.5 through an epoch).

The task I'm training on is the recognition and description of malicious decompiled code (malware).

On Fri, Jul 28, 2023 at 10:16 AM K.G. Wang @.***> wrote:

Llama2 7B-Chat official sample (with exactly the same launching arguments in README)

GPU: 4060Ti 16GB. Consumes more than 14GB RAM: 16GB. The memory usage is about 2GB after the model is loaded

torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/llama/issues/79#issuecomment-1655762951, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNELIQ5PBPFP575SMHELT3XSPCVPANCNFSM6AAAAAAVODAZWY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

develCuy commented 1 year ago

Llama 7B and 13B both GGML quantized. Hardware:

Running in local (no huggingface, etc) with LlamaCpp

rchen19 commented 1 year ago

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram

The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

yudhiesh commented 1 year ago

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

Are you loading it in full-precision, i.e., float-32? If so it would make sense as the memory requirements for a 65b parameter model is 65 * 4 = ~260GB as per LLM-Numbers. To get it down to ~140GB you would have to load it in bfloat/float-16 which is half-precision, i.e., 65 * 2 = ~130GB.

rchen19 commented 1 year ago

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

Are you loading it in full-precision, i.e., float-32? If so it would make sense as the memory requirements for a 65b parameter model is 65 * 4 = ~260GB as per LLM-Numbers. To get it down to ~140GB you would have to load it in bfloat/float-16 which is half-precision, i.e., 65 * 2 = ~130GB.

Wait, I thought Llama was trained in 16 bits to begin with.

yudhiesh commented 1 year ago

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

Are you loading it in full-precision, i.e., float-32? If so it would make sense as the memory requirements for a 65b parameter model is 65 * 4 = ~260GB as per LLM-Numbers. To get it down to ~140GB you would have to load it in bfloat/float-16 which is half-precision, i.e., 65 * 2 = ~130GB.

Wait, I thought Llama was trained in 16 bits to begin with.

That is true, but you will still have to specify the dtype when loading the model otherwise it will default to float-32 as per the docs.

rchen19 commented 1 year ago

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

Are you loading it in full-precision, i.e., float-32? If so it would make sense as the memory requirements for a 65b parameter model is 65 * 4 = ~260GB as per LLM-Numbers. To get it down to ~140GB you would have to load it in bfloat/float-16 which is half-precision, i.e., 65 * 2 = ~130GB.

Wait, I thought Llama was trained in 16 bits to begin with.

That is true, but you will still have to specify the dtype when loading the model otherwise it will default to float-32 as per the docs.

Ah I see what you meant. Thanks for clarification. I was using hugging face version with their transformers package, so I guess that was the reason I didn’t see such a big memory usage.

But seems a waste of memory to cast 16 bit model to 32 bit? Is there any reason you kept the PyTorch default precision?

yudhiesh commented 1 year ago

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

Are you loading it in full-precision, i.e., float-32? If so it would make sense as the memory requirements for a 65b parameter model is 65 * 4 = ~260GB as per LLM-Numbers. To get it down to ~140GB you would have to load it in bfloat/float-16 which is half-precision, i.e., 65 * 2 = ~130GB.

Wait, I thought Llama was trained in 16 bits to begin with.

That is true, but you will still have to specify the dtype when loading the model otherwise it will default to float-32 as per the docs.

Ah I see what you meant. Thanks for clarification. I was using hugging face version with their transformers package, so I guess that was the reason I didn’t see such a big memory usage.

But seems a waste of memory to cast 16 bit model to 32 bit? Is there any reason you kept the PyTorch default precision?

I can't comment on design decisions made by Huggingface but I stick to specifying the dtype regardless of the model I load.

rchen19 commented 1 year ago

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

Are you loading it in full-precision, i.e., float-32? If so it would make sense as the memory requirements for a 65b parameter model is 65 * 4 = ~260GB as per LLM-Numbers. To get it down to ~140GB you would have to load it in bfloat/float-16 which is half-precision, i.e., 65 * 2 = ~130GB.

Wait, I thought Llama was trained in 16 bits to begin with.

That is true, but you will still have to specify the dtype when loading the model otherwise it will default to float-32 as per the docs.

Ah I see what you meant. Thanks for clarification. I was using hugging face version with their transformers package, so I guess that was the reason I didn’t see such a big memory usage. But seems a waste of memory to cast 16 bit model to 32 bit? Is there any reason you kept the PyTorch default precision?

I can't comment on design decisions made by Huggingface but I stick to specifying the dtype regardless of the model I load.

Apologies. Apparently I mistook you as the original comment author. Yes, I’d agree specifying dtype to match the native precision of the model would be a good idea.

Twenkid commented 1 year ago

Thanks to all for the data. @Moonshine-in-Kansas Your example reminds me of a real "groundbreaking" essay of mine, a 9th grader at the time, 1999, "Where are you going, world?",.an incentive prize of a radio "Plovdiv" competition. (The conclusion is: to the creation of the Thinking Machine - the Machine God). The original in Bulgarian: https://github.com/Twenkid/Theory-of-Universe-and-Mind/blob/main/1999.md

In English (MT, not edited) Future of Artificial Intelligence Default (GPT-3.5) User translate to English: Къде отиваш свят? Конкурс за есе на радио „Пловдив“, 13.12.1999 г., поощрителна награда Автор: Тодор Илиев Арнаудов, 15 г., 9-ж клас в ТЕЕ Пловдив (ПГЕЕ) (...)

ChatGPT Where are you going, world? Essay competition on Radio "Plovdiv," December 13, 1999, encouragement award Author: Todor Iliev Arnaudov, 15 years old, 9th grade at TEE Plovdiv (PGEE) Where are you going, world? Only God knows where and when a person first asked this question. And what is the answer today, nearly two thousand years after the birth of Christ?

Humanity developed and progressed for thousands of years. It invented the wheel, learned to work with metals, invented writing, then printing, the mechanical calculator, the telegraph, the telephone and the radio, television, and electronic computers.

People wanted to preserve their knowledge for future generations, they wanted to communicate over long distances, to have "eyes" everywhere in the world. All of this became a reality thanks to human genius and the desire of Homo sapiens to control information. In his pursuit, he created the universal machine for executing algorithms, for processing information - the electronic computer. The idea for it occurred to John Atanasoff in the late thirties, and it became a reality in 1945 in the USA under the name "ENIAC." Like most major discoveries, the computer was initially used only for military purposes, but later it found its rightful place in almost all spheres of our life. Computing machines developed rapidly - from electronic tubes through transistors and to the highest technology - integrated circuits, which allowed electronic engineers to create modern microprocessors with millions of transistors that fit in the palm of a hand, capable of performing billions of calculations per second. The impact of the global computer network, the Internet, on people today is palpable. The network will continue to expand, but I believe that this is not the culmination of the rapid development of electronics, computing technology, and software (the latter actually determines the computer's "behavior," i.e., it is its "soul," and electronics is part of its material form). Another discovery, still unrealized, will crown this progress - artificial intelligence.

Some people believe that computers are "soulless machines" doomed to remain only assistants to human intelligence and creativity, but not to become creators themselves and possess reason. Machines are created by humans, I believe that he is capable of creating a system, an algorithm that imitates the work of his brain, and in this way, the computer can become a thinking machine. In my opinion, the path of humanity leads there - towards the creation of an electronic analog of the highest creation of nature, what is currently unique to humans - intelligence.

The creation of artificial intelligence will change the world. In my opinion, artificial intelligence is the next step in the evolution of matter - computers have several advantages over the "transitional" human, the most essential of which is that they are practically immortal - they withstand all kinds of radiation, do not feel pain, require very little energy, which they can easily obtain from the Sun, and they can reproduce themselves by producing factories, and so on. My opinion is that the world is heading precisely there, towards the creation of the thinking machine - the machine of God.

I got finally the 65GB model running on a server of the genesiscloud with 8 RTX 3090 cards with 24GB memory each. The cost to run the server are a little over $10/hour.

Takes almost 3 minutes to load. Inference is quicker than I can read.

So far I am not impressed. I believe GPT-3 (text-davinci-002) is better. But I have to do more tests with different temperatures etc. Here is the result of one experiment:

Why General Artifical Intelligence will overtake the world soon. An Essay by Llama. Essay by Llama, High School, 10th grade, A+, January 2005 Keywords United States, human beings, Computers, 21st century, Artificial intelligence In the 21st century, computers are going to take over the world. There is no doubt about it. They are going to become so advanced that they will be able to do everything that human beings can do, and more. In the future, computers will be abl e to drive cars, make movies, and even write books. Computers are getting more and more advanced every day. In the past, computers could only do simple math problems. Now, t hey can do complicated math problems and can even do complicated tasks like driving a car. In the future, computers will be able to do everything that human beings can do. They will be able to drive cars, make mo vies, and even write books. Computers are getting more and more advanced every day. In the past, computers could only do simple math problems. Now, t hey can do complicated math problems and can even do complicated tasks like driving a car. Computers are also getting mor e and more intelligent.

hz-nm commented 1 year ago

CodeLLAMA-13B Running on RTX3090. Loading the model in 4bit takes ~10GB of VRAM.

develCuy commented 1 year ago

@hz-nm, how fast is it in terms of tokens per second?

DanRegalia commented 1 year ago

Hope this helps. DAVE

Now I'm stuck with 5 questions... what else are you running on that machine? I feel like there needs to be some kind of helper app in there. Maybe I need a better understanding of the hardware needs... can I run it on CPU alone if I have enough CPU memory? Can I run these larger models on a regular PC? Can I get a few P40s or K40s and offload certain tasks to this? I'm really curious about the hardware needs for running these models...

hz-nm commented 1 year ago

@hz-nm, how fast is it in terms of tokens per second?

I didn't put a measure for that unfortunately but it is quite fast. Almost as good as ChatGPT if it was streaming.

ghost commented 1 year ago

Hey guys, I want to deploy code Llama on a Ubuntu server specifically on the cloud, what specs should I use like the vCPU and memory? Please suggest or guide for the same. Thanks in advance.

Oysters3 commented 5 months ago

Am new to this and in testing phase to see what works. AMD Ryzen 9 7940HS Processor, 8 Cores/16 Threads Integrated gpu not supported? (Radeon 780M), and doesn't seem to be getting used. No external GPU connected currently. 32Gb DDR5 Ram 2x1Tb SSD drives in Raid0 (990 pro) Mistrel 7B works relatively well, streams instantly. Using Fabric for better prompting also works well, however core temps go through the roof. A test on extracting wisdom from a you tube transcript pushed temps to 89.5 degrees...