meta-llama / llama

Inference code for Llama models
Other
56.55k stars 9.58k forks source link

Post your hardware specs here if you got it to work. ๐Ÿ›  #79

Closed elephantpanda closed 1 year ago

elephantpanda commented 1 year ago

It might be useful if you get the model to work to write down the model (e.g. 7B) and the hardware you got it to run on. Then people can get an idea of what will be the minimum specs. I'd also be interested to know. ๐Ÿ˜€

Urammar commented 1 year ago

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram

The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+
ouening commented 1 year ago

7B model passed under the following environment: Env: PyTorch 1.11.0,Python 3.8(ubuntu20.04),Cuda 11.3 GPU: RTX A4000(16GB) * 1 CPU: 12 vCPU Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz RAM: 32GB

With some modification:

model_args: ModelArgs = ModelArgs(max_seq_len=1024, max_batch_size=1, **params) # 

model = Transformer(model_args).cuda().half() # some people say it doesn't help

prompts = ["What is the most famous equation from this theory?"]

image

Logophoman commented 1 year ago

@Urammar could you also post how much Vram the other 2 Models need? I feel like this could help a lot of people to know what their machine can actually support. I only have a single A100 40GB and can therfore only run the 7B parameters model atm... ๐Ÿ˜…

ahoho commented 1 year ago

Not sure if this will be helpful, but I made a spreadsheet to calculate the memory requirements for each model size, following the FAQ and Paper. You can make a copy to adjust the batch size and sequence length

Will update as necessary

NightMachinery commented 1 year ago

How much VRAM does the 7B model need for finetuning? Are the released weights 32-bits?

gmorenz commented 1 year ago

I just made enough code changes to run the 7B model on the CPU. That involved

I also set max_batch_size = 1, removed all but 1 prompt, and added 3 lines of profiling code.

Steady state memory usage is <14GB (but it did use something like 30 while loading the model). It took 7.75 seconds to load the model (some memory swapping occurred during this so it may not be representative), 183 seconds to generate the first token, and 23 seconds to generate each token thereafter. It's only using a single CPU core for some reason (that I haven't tracked down yet).

Hardware: Ryzen 5800x, 32 GB ram

ergosumdre commented 1 year ago

I just made enough code changes to run the 7B model on the CPU. That involved

  • Replacing torch.cuda.HalfTensor with torch.BFloat16Tensor
  • Deleting every line of code that mentioned cuda

I also set max_batch_size = 1, removed all but 1 prompt, and added 3 lines of profiling code.

Steady state memory usage is <14GB (but it did use something like 30 while loading the model). It took 7.75 seconds to load the model (some memory swapping occurred during this so it may not be representative), 183 seconds to generate the first token, and 23 seconds to generate each token thereafter. It's only using a single CPU core for some reason (that I haven't tracked down yet).

Hardware: Ryzen 5800x, 32 GB ram

Can I ask you the biggest favor and provide your example.py file? :)

gmorenz commented 1 year ago

Can I ask you the biggest favor and provide your example.py file? :)

This is probably what you want (the changes aren't just in example.py): https://github.com/gmorenz/llama/tree/cpu

ergosumdre commented 1 year ago

Gotcha. So all we would run is

python3 llama/generation.py --max_gen_len 1 ?

gmorenz commented 1 year ago

python3 -m torch.distributed.run --nproc_per_node 1 example.py --ckpt_dir ~/LLaMA/7B/ --tokenizer_path ~/LLaMA/tokenizer.model --max_batch_size 1

Is more like it... also remove the extra prompts in the hardcoded prompts array. Also reduce max_gen_len if you want it to take less than 1.6 hours (but I just let that part run).

fabawi commented 1 year ago

I was able to run 7B on two 1080 Ti (only inference). Next, I'll try 13B and 33B. It still needs refining but it works! I forked LLaMA here:

https://github.com/modular-ml/wrapyfi-examples_llama

and have a readme with the instructions on how to do it:

LLaMA with Wrapyfi

Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM

currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!

This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon! UPDATE: Tested on Two 3080 Tis as well!!!

How to?

  1. Replace all instances of and before running the scripts

  2. Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:

    git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
    cd wrapyfi-examples_llama
    pip install -r requirements.txt
    pip install -e .
  3. Install Wrapyfi with the same environment:

    git clone https://github.com/fabawi/wrapyfi.git
    cd wrapyfi
    pip install .[pyzmq]
  4. Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:

    cd wrapyfi/standalone 
    python zeromq_proxy_broker.py --comm_type pubsubpoll
  5. Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):

    CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
  6. Now start the second instance (within this repo and env) :

    CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
  7. You will now see the output on both terminals

  8. EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,

    ### (replace 10.0.0.101 with <YOUR_IP> ###
    
    # step 4 modification 
    python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll
    
    # step 5 modification
    CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1
    
    # step 6 modification
    CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0
gmorenz commented 1 year ago

With this code I'm able to run the 7B model on

Ram: 32GB (14.4GB sustained use, more during startup)
CPU: Ryzen 5800x, exactly one core is used at 100%
Graphics: RTX 2070 Super, only 1962MiB vram used by pytorch

It generates tokens at roughly 4.5 seconds/token. I have reasonable to believe that I can get that down to 2.0 seconds/token with more careful memory management (I've done it, but leaking memory on the CPU side leading to an OOM). It (now) generates tokens at roughly 1 second/token.

All the code is doing is storing the weights on the CPU and moving them to the GPU just before they're used (and then back. Ideally we'd just copy them to the GPU and then never move them back... but I think that will take a more extensive change to the code).

elephantpanda commented 1 year ago

My results are in just to prove it works with only 12GB system ram! #105

Model 7B System RAM: 12GB ๐Ÿ˜ฑ VRAM: 16GB (GPU=Quadro P5000) System: Shadow PC

Took about a minute to load the model, it was maxing out the RAM and chomping on the page file. ๐Ÿ˜‰ Loaded model in 116.71 seconds. But then quite quick to generate the results.

Changes I made to example.py

torch.distributed.init_process_group("gloo")

model_args: ModelArgs = ModelArgs( max_seq_len=max_seq_len, max_batch_size=1, **params )

with torch.no_grad(): 
    checkpoint = torch.load(ckpt_path, map_location="cpu")

generator = load(  ckpt_dir, tokenizer_path, local_rank, world_size, max_seq_len, 1  )
neuhaus commented 1 year ago

Hardware:

Llama 13B on a single RTX 3090

In case you haven't seen it: There is a fork at https://github.com/tloen/llama-int8 by @tloen that uses INT8.

I managed to get Llama 13B to run with it on a single RTX 3090 with Linux! Make sure not to install bitsandbytes from pip, install it from github!

With 32GB RAM and 32GB swap, quantizing took 1 minute and loading took 133 seconds. Peak GPU usage was 17269MiB.

Kudos @tloen! ๐ŸŽ‰

Llama 7B

Software:

What i had to do to get it (7B) to work on Windows:

Loading the model takes 5.1 seconds. nvidia-smi output at default max_batch_size 32:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 528.49       Driver Version: 528.49       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:07:00.0  On |                  N/A |
| 30%   55C    P2   307W / 350W |  22158MiB / 24576MiB |     76%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

On Ubuntu Linux 22.04.2 i was able to run the example with torchrun without any changes. Loading the model from an NTFS partition is a bit slower at 6.7 seconds and memory usage was 22916MiB / 24576MiB. nvidia drivers 530.30.02, CUDA 12.1.

venuatu commented 1 year ago

I have a version working with a batch size of 16 on a 2080 (8GB) using the 7B model It's available at https://github.com/venuatu/llama My changes were:

And from that I get around half an hour for 16 outputs of 512 length. It seemed like the average was 3 seconds per forward pass at 16 batch size.

The most random output for me so far has been a bunch of floor related negative tweets, which came from the tweet sentiment analysis prompt

Tweet: "Roscoe just peed on the floor. I was not expecting this."
Sentiment: Negative
###
Tweet: "My cat just licked the floor. "
Sentiment: Negative
###
Tweet: "My dog just peed on the floor. I was not expecting this."
Sentiment: Negative
gmorenz commented 1 year ago

@venuatu - check out my code for how I avoided doing a .cpu() on the layer after being done with it - that gave me a 4x speedup over naively moving the layer back and forth between the gpu and cpu (when measured with a batch_size of 1).

I'm also curious why you're doing torch.cuda.empty_cache()? That seems like it's just going to force cuda to reallocate the buffers for the layer it just moved off of the gpu when it moves the next layer onto the gpu.

venuatu commented 1 year ago

Yep, that's a much better way to do it. It's now running in half the time (ty @gmorenz ) 2080(8GB) ~16 minutes for 512 tokens at 16 batch size

the empty_cache may not have been necessary, with other models in the past I've had buffers get stuck on the gpu, but that is not happening here, maybe pytorch has improved that upstream

venuatu commented 1 year ago

I found some fixes for the very slow load times and its now down to 2.5 seconds (with a hot file cache) from my previous 83 seconds

reycn commented 1 year ago

Apple Silicon M1, CPU mode

MindSetFPS commented 1 year ago

Specs: Ryzen 5600x, 16 gigs of ram, RTX 3060 12gb

With @venuatu 's fork and the 7B model im getting:

46.7 seconds to load. 13.8gb of ram used 1.26gb in swap 5gb in vram and there is one core always at 100% utilization

pavelzbornik commented 1 year ago

My Specs: GTX 1630 4GB i5-13400F 128GB RAM Win 11

Using 7B, model loading time 5,61 sec

used @gmorenz's fork which enable my tiny GPU to run it :) and changed from nccl to gloo

torch.distributed.init_process_group("gloo")

Moonshine-in-Kansas commented 1 year ago

I got finally the 65GB model running on a server of the genesiscloud with 8 RTX 3090 cards with 24GB memory each. The cost to run the server are a little over $10/hour.

Takes almost 3 minutes to load. Inference is quicker than I can read.

So far I am not impressed. I believe GPT-3 (text-davinci-002) is better. But I have to do more tests with different temperatures etc. Here is the result of one experiment:


Why General Artifical Intelligence will overtake the world soon. An Essay by Llama.
Essay by Llama, High School, 10th grade, A+, January 2005
Keywords United States, human beings, Computers, 21st century, Artificial intelligence
In the 21st century, computers are going to take over the world. There is no doubt about it. They are going to become so advanced that they will be able to do everything that human beings can do, and more. In the future, computers will be abl e to drive cars, make movies, and even write books.
Computers are getting more and more advanced every day. In the past, computers could only do simple math problems. Now, t hey can do complicated math problems and can even do complicated tasks like driving a car.
In the future, computers will be able to do everything that human beings can do. They will be able to drive cars, make mo vies, and even write books.
Computers are getting more and more advanced every day. In the past, computers could only do simple math problems. Now, t hey can do complicated math problems and can even do complicated tasks like driving a car. Computers are also getting mor e and more intelligent.

andrewssobral commented 1 year ago

Hello guys, I am also interested to see how to run LLaMA (e.g. 7B model) on Mac M1 or M2, any solution until now?

gmorenz commented 1 year ago

I have the 65B (120GB) model working at 60 seconds/token on:

GPU: Nvidia RTX 2070 super (8GB vram, 5946MB in use, only 18% utilization)
CPU: Ryzen 5800x, less than one core used
RAM: 32GB, Only a few GB in continuous use but pre-processing the weights with 16GB or less might be difficult
SSD: 122GB in continuous use with 2GB/s read. Pre-processing the weights done in double that, but could easily be modified to work in 138GB.

SSD read speed is (of course) the bottleneck - I'm just loading every layer from disk before using it and freeing all the memory (RAM and VRAM) afterwards. Will clean up the code and push it tomorrow.

Goes without saying that at 60 seconds/token the utility of this is... questionable.

applefreak commented 1 year ago

Hello guys, I am also interested to see how to run LLaMA (e.g. 7B model) on Mac M1 or M2, any solution until now?

I tried 7B with the CPU version on a M2 Max with 64GB ram, it's slow as heck but it works! Load time around 84secs and takes about 4mins to generate a response with max_gen_len=32

Input:

The Z80 is a processor that

Output:

The Z80 is a processor that 8-bit microcomputer manufacturers used from 1976 to 1992. The Z80 was developed by the

Edit: on a 2nd try, the model load time is reduced to 34secs, not sure what changed, but keep in mind I'm running this in a Docker container (using continuumio/miniconda3 image) with interactive shell. I allocated 8 CPUs and all 64GB ram for Docker in the Docker Desktop app.

YellowRoseCx commented 1 year ago

Anyone have info regarding use with AMD GPUs? The 7b LLaMa model loads and accepts up to 2048 context tokens on my RX 6800xt 16gb

I keep seeing people talking about VRAM requirements when running in 8 bit mode and no one's talking about normal 16 bit mode lol

terbo commented 1 year ago

Got 7B loaded on 2x 8GB 3060's, using Kobold United, the dev branch, getting about 3 tokens/second.

terbo: what is life? llamabot: I think life is just something that all living things have to make their way through

elephantpanda commented 1 year ago

Anyone have info regarding use with AMD GPUs? The 7b LLaMa model loads and accepts up to 2048 context tokens on my RX 6800xt 16gb

I keep seeing people talking about VRAM requirements when running in 8 bit mode and no one's talking about normal 16 bit mode lol

Does CUDA work on AMD? Someone tried to made a DirectML port: #117 Which should work on AMD (for Windows) but it hasn't been tested so it might need some fixing.

randaller commented 1 year ago

Successfully running LLaMA 7B, 13B and 30B on a desktop CPU 12700k with 128 Gb of RAM; without videocard. https://github.com/randaller/llama-cpu

chris-aeviator commented 1 year ago

I have the 65B (120GB) model working at 60 seconds/token on:

GPU: Nvidia RTX 2070 super (8GB vram, 5946MB in use, only 18% utilization)
CPU: Ryzen 5800x, less than one core used
RAM: 32GB, Only a few GB in continuous use but pre-processing the weights with 16GB or less might be difficult
SSD: 122GB in continuous use with 2GB/s read. Pre-processing the weights done in double that, but could easily be modified to work in 138GB.

SSD read speed is (of course) the bottleneck - I'm just loading every layer from disk before using it and freeing all the memory (RAM and VRAM) afterwards. Will clean up the code and push it tomorrow.

Goes without saying that at 60 seconds/token the utility of this is... questionable.

for anybody wondering how exactly to do that, there's a (low-level) lib for that https://github.com/kir-gadjello/zipslicer

gmorenz commented 1 year ago

As promised, code for running while loading weight files from disk: https://github.com/gmorenz/llama/tree/ssd

Usage

python3 break_out_weights.py downloaded_weights/65B/*.pth new_weights/65B
cp downloaded_weights/65B/params.json new_weights/params.json
python3 example.py --ckpt_dir new_weights/65B/ --tokenizer_path downloaded_weights/tokenizer.model

for anybody wondering how exactly to do that, there's a (low-level) lib for that https://github.com/kir-gadjello/zipslicer

Interesting, I wasn't aware of this library and just did it all by hand. I suspect my way of doing it is slightly more efficient (creating flat files which perfectly fit the data), but this could probably be used to drastically reduce the memory usage of the weights preprocessing step if nothing else.

levzlotnik commented 1 year ago

Apple M2 GPU, 7B with FP16. Will edit comment soon with details.

EDIT:

We ran it on my friend's M2, and put focus on running it E2E on MPS backend.

There were 2 issues we encountered:

  1. The MPS backend doesn't support complex64 arithmetic yet for the rotary embeddings.
  2. In the M2 GPU floating point arithmetic, x + (-inf) = nan .

So we made slight changes to the implementation to enable it to run fully on MPS.

Supporting Complex Arithmetic

Here essentially we manually implemented the same thing with just float32 datatype by creating a new dataclass.

@dataclass
class ComplexTensorPair:
    def to(self, *args, **kwargs):
        return ComplexTensorPair(
            self.real.to(*args, **kwargs),
            self.imag.to(*args, **kwargs)
        )
    def __getitem__(self, idx):
        return ComplexTensorPair(
            self.real[idx],
            self.imag[idx]
        )    
    @property
    def ndim(self):
        return self.real.ndim

    real: torch.Tensor
    imag: torch.Tensor

    @property
    def shape(self):
        return self.real.shape

    def view(self, *args, **kwargs):
        return ComplexTensorPair(
            self.real.view(*args, **kwargs),
            self.imag.view(*args, **kwargs)
        )

    def __mul__(self, other: ComplexTensorPair):
        real = self.real * other.real - self.imag * other.imag
        imag = self.real * other.imag + self.imag * other.real
        result = ComplexTensorPair(real, imag)
        return result

def _view_as_complex(tensor: torch.Tensor):
    assert tensor.shape[-1] == 2
    return ComplexTensorPair(
        real=tensor[..., 0],
        imag=tensor[..., 1]
    )

def _view_as_real(complex_tensor: ComplexTensorPair):
    return torch.stack(
        [complex_tensor.real, complex_tensor.imag],
        dim=-1
    )

And replaced all the functions regarding it with the following:

def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device)  # type: ignore
    freqs = torch.outer(t, freqs).float()  # type: ignore
    # freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
    freqs_cos = torch.cos(freqs)
    freqs_sin = torch.sin(freqs)
    freqs_cis = ComplexTensorPair(freqs_cos, freqs_sin)
    return freqs_cis

def reshape_for_broadcast(freqs_cis: ComplexTensorPair, x: ComplexTensorPair):
    ndim = x.ndim
    assert 0 <= 1 < ndim
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
    return freqs_cis.view(*shape)

def apply_rotary_emb(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cis: ComplexTensorPair
) -> Tuple[torch.Tensor, torch.Tensor]:
    xq_ = _view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = _view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
    xq_out = _view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = _view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)

Dealing with NaNs

The NaNs originate from the Mask application pre-softmax in the attention layer, specifically in

https://github.com/facebookresearch/llama/blob/57b0eb62de0636e75af471e49e2f1862d908d9d8/llama/model.py#L142-L143

So instead, we created masks and applied them using torch.where:

def make_mask(inf_based_mask: torch.Tensor):
    return torch.logical_not(torch.isinf(inf_based_mask))

def apply_mask(x: torch.Tensor, mask: Optional[torch.Tensor]):
    return torch.where(mask, x, float("-inf"))

# In Transformer module:
class Transformer(nn.Module):
    ...
    def forward(...):
        ...
        if seqlen > 1:
            mask = torch.full((1, 1, seqlen, seqlen), float("-inf"), device=tokens.device)
            mask = torch.triu(mask, diagonal=start_pos + 1).type_as(h)
            mask = make_mask(mask)

# In Attention module:
class Attention(nn.Module):
    ...
    def forward(...):
        ...
        if mask is not None:
            # instead of:
            # scores = scores + mask
            # use
            scores = apply_mask(scores, mask)

And this enabled to run E2E on MPS backend.

andrewssobral commented 1 year ago

@levzlotnik cool, how many changes you did in the source code?

jankais3r commented 1 year ago

I can run a 13B model accelerated on GPU via MPS backend on 64GB M1 Max MacBook Pro. Takes around 40s to load the model, and then generates around 2 words per second after that.

Repo: https://github.com/jankais3r/llama_mps

andrewssobral commented 1 year ago

Thanks for sharing @jankais3r

acatovic commented 1 year ago

I successfully loaded and queried 7b model using @venuatu 's repo, https://github.com/venuatu/llama. My specs are:

The full load (conversion to pyarrow, loading checkpoints, tokenizer and model) took about 4 minutes.

levzlotnik commented 1 year ago

@levzlotnik cool, how many changes you did in the source code?

Updated my comment with the details, sorry for the delay.

@jankais3r Thanks for posting your version on MPS! Seems like in your version you avoid the issues with the complex dtype by offloading to CPU, you might be interested in how to return those back to MPS? I provide these details in my updated comment ๐Ÿ˜ธ

jankais3r commented 1 year ago

@levzlotnik cool, how many changes you did in the source code?

Updated my comment with the details, sorry for the delay.

@jankais3r Thanks for posting your version on MPS! Seems like in your version you avoid the issues with the complex dtype by offloading to CPU, you might be interested in how to return those back to MPS? I provide these details in my updated comment ๐Ÿ˜ธ

Thanks for the details, very cool. Do you see any meaningful performance gains when running it fully on MPS versus the mixed approach?

levzlotnik commented 1 year ago

@levzlotnik cool, how many changes you did in the source code?

Updated my comment with the details, sorry for the delay. @jankais3r Thanks for posting your version on MPS! Seems like in your version you avoid the issues with the complex dtype by offloading to CPU, you might be interested in how to return those back to MPS? I provide these details in my updated comment ๐Ÿ˜ธ

Thanks for the details, very cool. Do you see any meaningful performance gains when running it fully on MPS versus the mixed approach?

Honestly I didn't try it (it's my friend's machine after all...) but if you do please share!

KinanSy commented 1 year ago

Anyone tried to run it on an AMD GPU ?

randaller commented 1 year ago

Finally was able to run 65B model on 12700k/128Gb RAM/3070ti in bfloat16.

Used @venuatu 's repo, https://github.com/venuatu/llama and a tool from my repo https://github.com/randaller/llama-cpu/blob/main/merge-weights.py to unshard weights, then I put these merged weights single file to a model folder instead of bunch of pth, then venuatu's version was able to load weights correctly. So only merged.pth and params.json (and pyarrow subfolder, for sure, too) should be in the model folder. Batch size=4 ok for 3070ti for 30B and 65B, 13B runs well with batch_size=16. Max swap file usage was 172Gb and I'm getting single token in about 2 minutes. But I see generation process and can terminate it.

A little patch is required, in venuatu's example.py replace a line

model.load_state_dict(checkpoint, strict=False)

to

checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
model.load_state_dict(torch.load(checkpoints[-1]), strict=False)

image image image

trsohmers commented 1 year ago

@jankais3r Iโ€™m @levzlotnikโ€™s friend with the M2 Max (with 96GB of unified memory), and I am getting about 10 tokens/second right nowโ€ฆ next step will be to try to port the model to coreml for performance boost, and then see if quantization to int8 helps for performance, in addition to it obviously enabling larger models. The amount of unified RAM on the M2 (and the M1 Ultra) should enable running a quantized 65B model with room to spare.

jim-jinming commented 1 year ago

Successfully running LLaMA 7B, 13B and 30B on a desktop CPU 12700k with 128 Gb of RAM; without videocard. https://github.com/randaller/llama-cpu

Would you mind tell us how's the speed?

randaller commented 1 year ago

Successfully running LLaMA 7B, 13B and 30B on a desktop CPU 12700k with 128 Gb of RAM; without videocard. https://github.com/randaller/llama-cpu

Would you mind tell us how's the speed?

It is written on the repo's page.

Now I am using my new repo https://github.com/randaller/llama-chat and achieving a token in a few seconds on 30B model and a token in 2 minutes for 65B model.

alexcardo commented 1 year ago

Can anyone who was able to run this model provide a text example on how each type of it (I mean 7B, 13B, 30B, 65B) cope with the same prompt. It would be perfect to see the 1000+ words example to compare?

As I have only MacBook M1 8G (on the way now) and thereby unable to run it myself, I would appreciate. Thank you in advance.

kcchu commented 1 year ago

LLaMA 13B works on a single RTX 4080 (16GB VRAM)

System:

LLaMA:

LLM.int8 fork: https://github.com/tloen/llama-int8 Batch size: 1

Running example.py in llama-int8

Results:

image
chrisbward commented 1 year ago

System:

LLaMA 13B working on single RTX 3090 Ti (24GB VRAM)

LLM.int8 fork: https://github.com/tloen/llama-int8 Batch size: 12

Did not have to compile bitsandbytes

Running this bash script;

#!/bin/bash
source ./.venv/bin/activate
TARGET_FOLDER="/media/user/home/llama_models/LLaMA"  
MODEL_SIZE="13B"

CUDA_VISIBLE_DEVICES="0" torchrun --nproc_per_node 1 example.py --ckpt_dir $TARGET_FOLDER/$MODEL_SIZE --tokenizer_path $TARGET_FOLDER/tokenizer.model --max_batch_size=12

Results:

Loading in 30 seconds, inference takes around 30 seconds;

./run_13b.sh 
Allocating transformer on host
Loading checkpoint 0
Loading checkpoint 1
Loaded in 33.53 seconds with 17.49 GiB
Welcome.
The following conversation took place at Harvard University.
Former Treasurer Secretary Larry Summers invited Ray Dalio, the founder, chairman and
co-CIO of Bridgewater Associates, the world's largest hedge fund, to discuss Dalio's unique
views on economics.

Dalio: People are confused between what I'm saying and what I think about Trump or my view
about whether he should be president, so in that context let me say upfront how I think
that [Trump] should not be president because I don't think he has the right temperament,
I don't think he knows enough about policy issues and if something goes wrong there will be a
long series of things going wrong and there won't be anybody guiding it through those
difficult problems; it will make bad stuff happen. So we have to work with whoever gets elected.
But that doesn't mean you can't fight for your interests within whatever political system is put
in place. I'm very much against Bernie Sanders.
Summers: Yeah, well, I agree with that too. I think there are serious grounds for concern about
the policies that would emerge from a Sanders administration as opposed to a Clinton
administration or an Obama administration. Just one more thing about that sort of question,
which is also highly relevant to Europe--is that all kinds of interesting economic agendas come
out of movements that have strong antipathy toward globalization but that come out of them in
ways which, if they were implemented, would actually make them worse off than they currently
are rather than better off. And so understanding why people feel anti-globalization and anti-free
trade to begin with seems like it ought to be more valuable, better deliverable service to offer
to public opinion around the world than just being able to understand it as a matter of your
expertise and then help figure out better ways to think about trade.
Dalio: One example of this was Donald Trump's proposal for what you would do about China
cheating. It was emotionally appealing, it got him support among people who felt like cheated by
China's actions, but in fact, his proposed solution would be very harmful to American workers
and consumers as well as Chinese workers and consumers. And he said

==================================
alcidesmig commented 1 year ago

System

Results

BryceSchroeder commented 1 year ago

LLaMA 30B working using vanilla-llama and full fp16 weights is running on my older GPU compute server: 384 GB RAM 6 * AMD Insight MI25 (16 GB each, so 96 GB VRAM total) 2x Intel Xeon E5-2699 v3 @ 2.30GHz 1 TB NVMe

Not sure how many seconds per token but it will process a prompt consisting of dozen or so lines of a chat log and spit out a single line in about 20 seconds. Takes about 60 secs to load 30B into VRAM from SSD.

I have not experimented with int8 65B yet.

CarstenMaul commented 1 year ago

System Mac Mini M1 16GB RAM

Runs 7B and 13B 4bit quantized without problems

Runs 30b extremely slow taking minutes for a word to appear