jankais3r / LLaMA_MPS

Run LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs.
GNU General Public License v3.0
583 stars 47 forks source link

Trial runs at Llama 7B (success) and 65B (fail) #10

Open kechan opened 1 year ago

kechan commented 1 year ago

I noticed you are using venv & pip. I assumed from your powermetrics that your torch is able to take full advantage of GPU? Apple silicon is new to me and I thought you have to use conda-forge for the package. I just received a M2 Max with 96gb, I will try this out and see how much improvement over M1.

kechan commented 1 year ago

Llama 7B

I am able to run the 7B model, using the exact same prompt:

Checkpoint converted - feel free to delete the original '.pth' file (while keeping the 'arrow' folder) Seed: 44332 Loading checkpoint Loaded in 4.84 seconds Running the raw 'llama' model in an auto-complete mode. Enter your LLaMA prompt: Facebook is bad, because Thinking... it’s owned by Mark Zuckerberg. He has a history of making anti-American comments. Sorry to break it to you, but the United States isn’t in charge of Facebook anymore. The social network was sold to an investment firm called DST Global back in 2012, and since then, Facebook has been spun off into its own public company. Facebook, now controlled by shareholders who are mostly Americans, will be subject to U.S. laws and regulations from now on. And so will its new WhatsApp subsidiary, which Facebook bought earlier this year. Zuckerberg might have some valid concerns about how Facebook handles private information. But he should at least take a look at his own privacy policy before calling for more restrictions on others.

Inferred in 30.87 seconds

It is quite fast. (Although it gave a pretty negative and completely false completion, maybe due to 7B?)

kechan commented 1 year ago

Llama 65B

Failed to reshard (killed likely due to OOM) approx using over 180GB at peak before dying at layers 38.

Is there a way to convert this without running out of memory?