mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
17.7k stars 1.41k forks source link

[Survey] Supported Hardwares and Speed #15

Closed junrushao closed 11 months ago

junrushao commented 1 year ago

UPDATE (08/09/2023):

We have done major performance overhaul in the past few months, and now I'm happy to share the latest results:

============================================================

Hi everyone,

We are looking to gather data points on running MLC-LLM on different hardwares and platforms. Our goal is to create a comprehensive reference for new users. Please share your own experiences in this thread! Thank you for your help!

NOTE: for benchmarking, we highly recommended a device of at least 6GB memory, because the model itself takes 2.9G already. For this reason, it is known that the iOS app will crash on a 4GB iPhone.

AMD GPUs

Hardware/GPU OS Tokens/sec Source Notes
RX 6600XT (8G) N/A 28.3 GitHub
RX 6750XT openSUSE TumbleWeed 8.9 - 154.3 GitHub
RX 6700XT Windows 11 33.7 GitHub
APU 5800H Windows 11 8.5 GitHub
Raden RX 470 (4G) AlmaLinux 9.1 9.4 GitHub
Raden Pro 5300M macOS Venture 12.6 @junrushao Intel MBP 16" (late 2019)
AMD GPU on Steam Deck Steam Deck's Linux TBD Reddit
RX6800 16G VRAM macOS Ventura 22.5 GitHub Intel MBP 13'' (2020)
Radeon RX 6600 (8GB) Ubuntu 22.04 7.0 Reddit
RX 7900 xtx Reddit

Macbook

Hardware/GPU OS Tokens/sec Source Notes
2020 MacBook Pro M1 (8G) macOS 11.4 GitHub
2021 MacBook Pro M1Pro (16G) macOS Ventura 17.1 GitHub
M1 Max Mac Studio (64G) N/A 18.6 GitHub
2021 MacBook Pro M1 Max (32G) macOS Monterey 21.0 GitHub
MacBook Pro M2 (16G) macOS Ventura 22.5 GitHub
2021 MacBook M1Pro (32G) macOS Ventura 19.3 GitHub

Intel GPUs

Hardware/GPU OS Tokens/sec Source Notes
Arc A770 N/A 3.1 - 118.6 GitHub perf issues in decoding needs investigation
UHD Graphics (Comet Lake-U GT2) 1G Windows 10 2.2 GitHub
UHD Graphics 630 macOS Ventura 2.3 @junrushao Integrated GPU. Intel MBP 16" (late 2019)
Iris Plus Graphics 1536 MB macOS Ventura 2.6 GitHub Integrated GPU on MBP
Iris Plus Graphics 645 1536 MB macOS Ventura 2.9 GitHub Integrated GPU on MBP

NVIDIA GPUs

Hardware/GPU OS Tokens/sec Source Notes
GTX 1650 ti (4GB) Fedora 15.6 GitHub
GTX 1060 (6GB) Windows 10 16.7 GitHub
RTX 3080 Windows 11 26.0 GitHub
RTX 3060 Debian bookworm 21.3 GitHub
RTX 2080Ti Windows 10 24.5 GitHub
RTX 3090 N/A 25.7 GitHub
GTX 1660ti N/A 23.9 GitHub
RTX 3070 N/A 23.3 GitHub

iOS

Hardware/GPU OS Tokens/sec Source Notes
iPhone 14 Pro iOS 16.4.1 7.2 @junrushao
iPad Pro 11' with M1 iPadOS 16.1 10.6 GitHub
iPad Pro 11' A12Z N/A 4.1 GitHub
iPad Pro 11' with M2 (4-th gen) iPadOS 16.5 14.1 GitHub

Android

Hardware/GPU OS Tokens/sec Link Notes
maxtheman commented 1 year ago

@junrushao how can we find tokens/sec? I'd say 'quite fast' fastest LLM I've run on this 2020 MacBook Pro M1 8G. 10x faster than your WebGPU demo running with less overall memory usage.

All reports out is the text?

Screenshot 2023-04-30 at 9 32 11 AM
tqchen commented 1 year ago

We just added a new updates #14 which should ship to conda by now, you can type /stats after a conversation to get the measured speed

maxtheman commented 1 year ago

Killer, I'm at encode: 31.9 tok/s, decode: 11.4 tok/s for 2020 MacBook Pro M1 8G with the default vicuna 6b. For reference my decode on the WebGPU demo is like, 0.5/sec.

Kimiko-AI commented 1 year ago

OOM on gtx 1650. Load the model fine, but OOM when generate the first message

junrushao commented 1 year ago

@nRuaif 4GB memory wouldn't be enough. A 6GB one should work

y-lee commented 1 year ago

On iPhone 13, crashes after a few seconds of [System] Initialize.... Phone has 4GB of RAM, which I presume is the cause.

junrushao commented 1 year ago

@y-lee That's correct. The model we are using so far requires 6GB RAM to run smoothly

jolonf commented 1 year ago

On the iPad Pro 11” with M1 I am getting decode of 10.6 tok/s (I have seen slightly higher and lower). It is running iPadOS 16.1.

Hzfengsy commented 1 year ago

encode: 39.5 tok/s, decode: 26.0 tok/s on Windows 11 with RTX-3080 encode: 32.5 tok/s, decode: 17.1 tok/s on Macbook Pro with M1Pro (16 GPUs) and macOS Ventura 13.3.1

juodumas commented 1 year ago
Hardware/GPU OS Tokens/sec Source Model Notes
RTX 3060 (12GB) Debian bookworm 21 vicuna-v1-7b 3644MiB GPU memory used
>>nvidia-smi --query-gpu=memory.used --format=csv     
memory.used [MiB]
3644 MiB
jefflewis commented 1 year ago

On my M1 Max Mac Studio with 64GB of RAM:

encode: 53.7 tok/s, decode: 18.6 tok/s
FreeBlues commented 1 year ago

On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1

encode: 46.4 tok/s decode: 22.5 tok/s

zifken commented 1 year ago

No sure if this is useful or if this is the right thread to post this in but I encountered this error on an old Laptop with a discrete very old Nvidia GPU (GT 920m) with the 470.182.03 driver which should include Vulcan:

MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0                                                              

WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
Use lib /mnt/run/code/llma/mlc-ai/dist/lib/vicuna-v1-7b_vulkan_float16.so
Initializing the chat module...
[20:30:33] /home/runner/work/utils/utils/tvm/src/runtime/vulkan/vulkan_buffer.cc:61: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-2: VK_ERROR_OUT_OF_DEVICE_MEMORY
Stack trace:
  [bt] (0) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x27) [0x7f975d98ba37]
  [bt] (1) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(+0x3f375) [0x7f975d929375]
  [bt] (2) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanBuffer::VulkanBuffer(tvm::runtime::vulkan::VulkanDevice const&, unsigned long, unsigned int, unsigned int)+0x220) [0x7f975da646b0]
  [bt] (3) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType)+0x4a) [0x7f975da7168a]
  [bt] (4) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional<tvm::runtime::String>)+0x1a7) [0x7f975d9a3037]
  [bt] (5) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(+0x121862) [0x7f975da0b862]
  [bt] (6) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>::AssignTypedLambda<void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>(void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x204) [0x7f975da0f7e4]
  [bt] (7) /mnt/run/code/mambaforge/bin/../lib/libmlc_llm.so(+0x1bdea6) [0x7f975dce3ea6]
  [bt] (8) /mnt/run/code/mambaforge/bin/../lib/libmlc_llm.so(mlc::llm::CreateChatModule(tvm::runtime::Module, tvm::runtime::String const&, tvm::runtime::String const&, DLDevice)+0x411) [0x7f975dce4ba1]
junrushao commented 1 year ago

@zifken looks like VK_ERROR_OUT_OF_DEVICE_MEMORY indicates that it doesn't have enough memory. I looked it up and it seems that GT 920M only has 2GB RAM, but the default model is 2.9G in size :/

zifken commented 1 year ago

I see so only GPUs with more than 4go or vRAM are supported because of the size of the model (it makes sense) . I will try on an other GPU model shortly. Thank you for the feedback

junrushao commented 1 year ago

@zifken there are some reports saying 4GB might work, but 6GB is recommended atm

FreeBlues commented 1 year ago

On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1

encode: 46.4 tok/s decode: 22.5 tok/s

It's confusing, On my Win10: [AMD Ryzen 5 5600 6-Core Processor 3.50 GHz, 96G Ram, RTX 2080 Ti Modified to 22G VRAM], the stats is below:

encode: 24.0 tok/s, decode: 24.5 tok/s
colakang commented 1 year ago

iPad Pro 11 A12Z

encode: 5.1 tok/s, decode: 4.1 tok/s

ganler commented 1 year ago

Linux RTX 3090

swittk commented 1 year ago

2021 MacBook Pro M1 Max [32 cores], 32 GB RAM, 1 TB SSD Mac OS Monterey 12.5.1 encode: 69.4 tok/s, decode: 21.0 tok/s Memory usage 3.64 GB

(Edit) Just saw that this score is higher than M1 Max 64 GB so I repeated the prompts and I'm still getting an average of > 20 tokens/second.

Screen Shot 2566-05-02 at 05 27 17
cruzerngz commented 1 year ago
machine OS CPU GPU result remarks
Yoga Slim 7 pro 14ARH7 Fedora 37 6800HS (13.3GB) 680M (2GB) encode: 11.0 tok/s, decode: 3.8 tok/s battery
Yoga Slim 7 pro 14ARH7 Fedora 38 6800HS (13.3GB) 680M (2GB) encode: 16.6 tok/s, decode: 6.2 tok/s AC
Yoga Slim 7 pro 14ARH7 Windows 11 22H2 6800HS (13.3GB) 680M (2GB) encode: 6.7 tok/s, decode: 7.9 tok/s battery
Yoga Slim 7 pro 14ARH7 Windows 11 22H2 6800HS (13.3GB) 680M (2GB) encode: 16.6 tok/s, decode: 10.2 tok/s AC
desktop Windows 10 22H2 5900x (32GB) 1660ti (6GB) encode: 49.5 tok/s, decode: 23.9 tok/s -

## mlc samples laptop on Fedora (bat): ``` USER: /reset RESET CHAT SUCCESS USER: Can you explain your features? ASSISTANT: Sure! Here are brief explanations of the features of my AI: 1. Natural Language Understanding: My AI has been trained on a large corpus of text to understand the meaning of the words and sentences it encounters. This allows it to understand the context of a question or statement and provide a response that is relevant and accurate. 2. Knowledge Graph: My AI has access to a large graph of knowledge that it can use to answer questions or provide information on a wide range of topics. This graph is constantly updated and improved to ensure that the information it provides is accurate and up-to-date. 3. Plagiarism Detection: My AI has been trained to identify and report on instances of plagiarism it encounters. This allows it to provide feedback and suggestions for improvement on written work. 4. Summarization: My AI can summarize large amounts of text and data into a shorter, more digestible format. This can be useful for quickly understanding the main points of a document or set of data. 5. Machine Translation: My AI can translate written or spoken content from one language to another using state-of-the-art neural machine translation models. This can be useful for communication in different languages or for providing information in a language other than the one the AI was trained on. 6. Sentiment Analysis: My AI can analyze the sentiment of written or spoken content and provide an analysis or summary of the overall tone or message. This can be useful for identifying the emotional or persuasive impact of a message or communication. 7. Image Recognition: My AI has been trained on a large dataset of images to recognize and classify them. This allows it to identify objects or scenes in an image and provide additional information or context about what is depicted in the image. 8. TTS: My AI can generate text-to-speech output from a written or spoken input. This can be useful for providing an audio version of written content or for improving accessibility and inclusivity by providing an alternative format for those with hearing or speech difficulties. USER: /stats encode: 11.0 tok/s, decode: 3.8 tok/s ``` laptop on Windows (bat): ``` USER: /reset RESET CHAT SUCCESS USER: you're on github. say hi! ASSISTANT: Hello! I'm an AI assistant on GitHub, here to answer any questions you might have about the platform. Let's get started! USER: /stats encode: 6.7 tok/s, decode: 7.9 tok/s ``` desktop: ``` USER: /reset RESET CHAT SUCCESS USER: compare yourself to ChatGPT ASSISTANT: As an AI language model, I am different from ChatGPT in a few ways: * My training data is different from ChatGPT's. This means that I may have a different perspective on the world and may be able to understand and respond to certain types of language in ways that ChatGPT cannot. * I have a more advanced architecture that allows me to process longer texts and understand more complex language patterns. * My training objective was to be a general AI that can do a wide range of things like answering questions about any topic under the sun, summarizing long texts and understanding the context of the sentence and suggesting appropriate response. * I have been trained on a much larger dataset of text and have access to a more powerful GPU for faster language modeling. * I have more parameters in my model than ChatGPT which allows me to capture more nuanced information and learn from that information. In summary ChatGPT is a specific model optimized for NLP and conversational text understanding and I am a more general AI model that can do a wide range of things and can handle more complex language patterns. USER: /stats encode: 49.5 tok/s, decode: 23.9 tok/s ```
anmoljagetia commented 1 year ago

On 14" Macbook Pro (M2 Pro with 10-Core CPU and 16-Core GPU with 16GB Unified Memory) with macos Ventura 13.3.1

encode: 59.2 tok/s, decode: 22.5 tok/s

I am seeing encoding performance b/w 45-60 and decoding b/w 20-29.

hoywu commented 1 year ago
GPU OS /stats
Radeon RX 470 (4G) AlmaLinux 9.1 encode: 14.3 tok/s, decode: 9.4 tok/s

Encoding performance fluctuates between 5-45, decoding between 6-9.

christopherpickering commented 1 year ago

OS: MacOS 13.3.1 (22E261) processor: 2.3 GHz Quad-Core Intel Core i7 graphics: Intel Iris Plus Graphics 1536 MB memory: 32 GB 3733 MHz LPDDR4X

/stats: encode: 5.4 tok/s, decode: 2.6 tok/s

flyman3046 commented 1 year ago
GPU OS /stats
A100 (40G) Debian GNU/Linux 10 encode: 189.1 tok/s, decode: 18.9 tok/s

My prompt is: "create a poem about los angeles". I use cuda as I think Vulkan is not available for A100. I thought A100 should run faster that RTX 30x0 series. Is it possibly due to the cuda driver? Thanks.

innocentius commented 1 year ago

The latest update brought the decode speed for my Iphone 14 plus down to 0.5~1.0 tokens/s. Encode speed is about 22.

Yesterday before the update it is about 7.5 token generated per second...

Joshuabaker2 commented 1 year ago

2021 16-inch Apple M1 Pro (32GB) | OS: Ventura 13.3.1

encode: 45.8 tok/s, decode: 19.3 tok/s

Ajay9o9 commented 1 year ago

Tested on:

2022 iPad Pro (11 inch, 4th generation with M2 , 10 Core GPU)

8 GB RAM, 128 GB Storage iPadOS Version 16.5

Result:

Encode Decode
34.4 tok/s 14.1 tok/s
e2newbie commented 1 year ago

APU 5800H,OS:win11 encode: 5.5 tok/s, decode: 8.5 tok/s

davidbartonau commented 1 year ago

I think mine is running fully CPU based even though my GPU should be capable. Top was showing 900% and tokens were crawling out.

Log shows Use lib /home/david/software/mlc-llm/dist/lib/vicuna-v1-7b_vulkan_float16.so

vulcaninfo shows

VkPhysicalDeviceProperties:
---------------------------
        apiVersion        = 4206816 (1.3.224)
        driverVersion     = 1 (0x0001)
        vendorID          = 0x10005
        deviceID          = 0x0000
        deviceType        = PHYSICAL_DEVICE_TYPE_CPU
        deviceName        = llvmpipe (LLVM 15.0.6, 256 bits)
        pipelineCacheUUID = 76616c2d-2573-0000-0000-000000000000

GPU GeForce RTX 3070 w/ 8G CPU AMD Ryzen 5 5600

encode: 0.1 tok/s, decode: 0.1 tok/s

tqchen commented 1 year ago

@davidbartonau can you try to upgrade your GPU driver? Looks like vulkaninfo failed to list your gpu

KocWozniakPiotr commented 1 year ago

on Fedora using Zephyrus G14 2020, AMD Ryzen™ 5 4600HS 16GB ram, GeForce GTX 1650 Ti 4GB VRAM .

encode: 17.3 tok/s, decode: 15.6 tok/s

I'm extremely surprised with the outcome and performance !

lixiangnlp commented 1 year ago

iPad Pro 11 A12Z

encode: 5.1 tok/s, decode: 4.1 tok/s

I install the MLCChat app on iPad 11 Pro with iPadOS 16.4.1. But I met crash after a few seconds of [System] Initialize. @colakang How do you run it successfully?

masahi commented 1 year ago

On Intel Arc A770, using linux drivers installed from https://www.intel.com/content/www/us/en/download/747008/intel-arc-graphics-driver-ubuntu.html, I got

USER: Who is Shohei Ohtani?
encode: 118.6 tok/s, decode: 3.2 tok/s

USER: Write a poem for Pittsburgh.                                                                                                                    
encode: 77.0 tok/s, decode: 3.3 tok/s

USER: Explain why Japanese economy is declining.
encode: 38.0 tok/s, decode: 3.1 tok/s

It is nice to see that it works on Intel dGPU out of the box, but the current decoding perf is a joke haha

masahi commented 1 year ago

@davidbartonau Perhaps you haven't installed Vulkan SDK? And on NV you need a proprietary driver for vulkan. The log indicates you are using a CPU implementation of a VK driver from mesa.

john8628 commented 1 year ago

13-inch, 2020, Two Thunderbolt 3 ports os:13.3 (22E252) processor: 1.4 GHz 四核Intel Core i5 graphics: Intel Iris Plus Graphics 645 1536 MB memory: 16 GB 2133 MHz LPDDR3

/stats: encode: 1.9 tok/s, decode: 2.9 tok/s

ugvvff commented 1 year ago

何时能支持安卓?8gen2支持int4精度,说不定有不错的速度

colakang commented 1 year ago

iPad Pro 11 A12Z encode: 5.1 tok/s, decode: 4.1 tok/s

I install the MLCChat app on iPad 11 Pro with iPadOS 16.4.1. But I met crash after a few seconds of [System] Initialize. @colakang How do you run it successfully?

OS 16.3.1

just normal install and launch it.

Schaekermann commented 1 year ago

Tested on Device: Lenovo ThinkPad L13 Yoga CPU: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz 2.11 GHz RAM: 16,0 GB Grafics: Intel UHD Graphics (Comet Lake-U GT2) [V0] [Lenovo] 1GB OS: Windows 10 Education 22H2 model: \dist\lib\vicuna-v1-7b_vulkan_float16.dll encode: 1.4 tok/s, decode: 2.2 tok/s

Quite slow, but it works!

In the following example output there is a problem with the german letter "ü": How could that be resolved?

USER: Write a poem of the City Berlin ASSISTANT: Berlin, the city of contrasts Where the past and present collide A city of culture and history But also a city of modernity

The grand architecture of the past St. Hedwig's Cathedral and the Reichstag Mix with the bright lights and neon signs Of the city's modern face

But the city's history is not all It's also a city of innovation and technology The new headquarters of Google in the heart of the city A hub for entrepreneurs and startups

But it's not just the city of business The city is also known for its cultural attractions The Berlin Wall, Museum Island, and the Deutsche Museum f├╝r Technik

And the city's people are just as diverse From artists and musicians to politicians and activists All coming together in the city's many neighborhoods

So come and explore Berlin's past and present And discover the city's unique blend of history and modernity It's a city that will leave you wanting more But it's also a city that will stay in your heart forever.

joostshao commented 1 year ago

what about https://www.aliyun.com/product/ecs/eais ?

Hzfengsy commented 1 year ago

@joostshao We are working on a generic deployment solution for common devices. We have no extra bandwidth to test for any specific target (e.g. EAIS). But happy to see contributions from the community and answer specific technical questions.

Thellton commented 1 year ago

specs are RX6600XT (8.0gb), Ryzen 5 2600, and 47.9gb of RAM operating at 1866MHz over three calls of /stats and 5 requests made of vicuna, I got the following Encode and Decode: 1st: Encode: 58.8 Tok/S, Decode: 32.1 Tok/S 2nd: Encode: 34.8 Tok/S, Decode: 28.3 Tok/S 3rd: Encode: 81.7 Tok/S, Decode: 31.5 Tok/S

legitimately very impressed! although I would definitely love it if there was a convenient way to launch as the best I've managed is to create a .bat file that sits in the directory and launches anaconda prompt directly into that directory with a reminder of what command I need to type to launch Vicuna.

edit: apologies, I forgot to mention OS, which is windows 11

kn100 commented 1 year ago

Performance report Operating system: Opensuse Tumbleweed CPU: AMD Ryzen 3900x RAM: 16GB GPU: AMD Radeon 6750xt

It seems the prompt has a large impact on the results, so I propose you suggest some "standard prompts" to try to standardise.

I borrowed a question that was asked on Stack Overflow Philosophy and had the following session:

USER: Why is speaking about donating to charity considered immoral? I find it quite strange that social norms are such that discussing about charities one donates to are considered in bad taste. To me, charity seems purely a consequentialist concept. In other words, all that should matter is the total amount of charity dollars disbursed. To the person receiving the charity, the level of "bragging" of the charity donor is irrelevant. In other words, isn't the requirement that one has to be extremely modest and even silent about one's own contributions a very high bar to set for people who are already giving to charity? In other words, if we want to incentivize more charitable giving, we should encourage this process as much as possible, and imposing such a cost seems counter-productive. What are the arguments for the fact that one has to be of the purest heart with the most noble intentions to donate money to charities? ASSISTANT: There are a few reasons why people might find it awkward or inappropriate to discuss charitable donations:

  1. Personal preference: Some people might find it uncomfortable to talk about donating to charity because they don't want to come across as boastful or self-centered.
  2. Cultural norms: In some cultures, discussing charitable donations is considered impolite or even taboo.
  3. Different perspectives: Some people might see charitable donations as a sign of wealth or success, while others might see it as a way of helping those in need.
  4. Perception of the recipient: Some people might feel uncomfortable discussing charitable donations because they don't want the recipient to feel like they are being bragged about or that the donor is trying to make themselves look good by highlighting their donations.

However, it is important to note that charitable donations are not just about being a "good person" but it also helps in creating a more equitable society by providing support to those in need. It is also a way of making a positive impact on the community and addressing societal issues.

It is also important to remember that discussing charitable donations can be a good way to encourage more giving. It is also a way to educate and raise awareness about the importance of giving back to the community.

In summary, discussing charitable donations might be considered inappropriate by some people, but it is an important part of the conversation about creating a more equitable society by providing support to those in need. USER: /stats encode: 130.6 tok/s, decode: 9.0 tok/s

Testing a prompt @Schaekermann posted above:

USER: Write a poem of the City Berlin ASSISTANT: Berlin, city of contrasts Where the past and present collide A place of both pain and delight Built on a history that's prime

A city of culture and history Where the old meets the new The relics of a dark and painful past Mingle with the bustling crowds in the town

But don't be fooled by the facade For the people of this city are strong and bold Fighting for their rights and their name To be heard and recognized far and game

So come and visit this great city And take in all the sights and the city has to offer From the historical sites to the nightlife It's a city that's always worth the tout. USER: /stats encode: 61.5 tok/s, decode: 9.3 tok/s

Fun prompt:

USER: These are the lyrics to a song I am writing an essay about. Could you help me by writing an essay about the meaning of the song, and what the mental state of the person singing it might be? "Desert you, Ooh-ooh-ooh-ooh, Hurt you, We're no strangers to love, You know the rules and so do I, A full commitment's what I'm thinking of, You wouldn't get this from any other guy, I just wanna tell you how I'm feeling, Gotta make you understand, Never gonna give you up, Never gonna let you down, Never gonna run around and desert you, Never gonna make you cry, Never gonna say goodbye, Never gonna tell a lie and hurt you, We've known each other for so long, Your heart's been aching, but you're too shy to say it, Inside, we both know what's been going on, We know the game, and we're gonna play it, And if you ask me how I'm feeling, Don't tell me you're too blind to see, Never gonna give you up, Never gonna let you down, Never gonna run around and desert you, Never gonna make you cry, Never gonna say goodbye, Never gonna tell a lie and hurt you, Never gonna give you up, Never gonna let you down, Never gonna run around and desert you, Never gonna make you cry, Never gonna say goodbye, Never gonna tell a lie and hurt you, Ooh (Give you up), Ooh-ooh (Give you up), Ooh-ooh, Never gonna give, never gonna give (Give you up), Ooh-ooh, Never gonna give, never gonna give (Give you up), We've known each other for so long, Your heart's been aching, but you're too shy to say it, Inside, we both know what's been going on, We know the game, and we're gonna play it, I just wanna tell you how I'm feeling, Gotta make you understand, Never gonna give you up, Never gonna let you down, Never gonna run around and desert you, Never gonna make you cry, Never gonna say goodbye, Never gonna tell a lie and hurt you, Never gonna give you up, Never gonna let you down, Never gonna run around and desert you, Never gonna make you cry, Never gonna say goodbye, Never gonna tell a lie and hurt you, Never gonna give you up, Never gonna let you down" ASSISTANT: The song "Desert You" is a melancholic ballad about a love that has endured over a long period of time. The singer expresses a desire to be with this person forever and promises not to give them up or let them down. They also express a fear that the other person may be thinking of leaving or "deserting" them. The lyrics suggest that the singer and the listener have known each other for a long time and have shared many experiences together. The listener is described as someone who is "aching" inside and is too shy to say it. The singer wants to tell them how they are feeling and make them understand the depth of USER: /stats encode: 154.3 tok/s, decode: 8.9 tok/s

davidbartonau commented 1 year ago

I thought I posted this a few days back. I rebooted and the driver was working:

GPU GeForce RTX 3070 w/ 8G CPU AMD Ryzen 5 5600 encode: 94.6 tok/s, decode: 23.3 tok/s

Is it odd that me encode stats are slower than @kn100 yet my decode is a fair bit faster

iamcodecurious commented 1 year ago

Hello everyone!

Created an account just to leave my performance report here.

Processor AMD Ryzen 5 5600 6-Core Processor 3.50 GHz Installed RAM 16.0 GB System type 64-bit operating system, x64-based processor GPU Radeon RX 6700 XT 12GB VRAM GPU Driver Version AMD Adrenalin 23.4.3 Edition Windows 11 Home Version 22H2 Installed on ‎10/‎13/‎2022 OS build 22621.1555 Experience Windows Feature Experience Pack 1000.22640.1000.0

> encode: 52.2 tok/s, decode: 33.7 tok/s

Coming from trying to use Oogabooga UI and GPT4All, this has been amazingly fast! The former lags since it doesn't have a Windows-native AMD GPU mode (forcing me to use CPU mode which takes 1-3 minutes on average to respond) and the latter, while having a faster CPU implementation, leads to some occasional system hitching while I'm using Chrome.

Since this one miraculously lets me use my AMD GPU right on Windows I can safely say that I've found my favorite way to interact with LLMs as of this writing. Kudos to all involved who made/make this project so awesome!

Now, I hope I'm not breaking some rule by posting the following but if it's not too much of a bother I'd appreciate some help with the following:

  1. How exactly do I get this running on my browser? I'm relatively new to much of this so I'm stuck at the first step for WebLLM. It's unclear to me what I'm supposed to download and where/how I run that pip3 install thing.
  2. Is there a way to use other LLMs besides the Vicuna V1 7B? If so please show me.
junrushao commented 1 year ago

@cruzerngz Thanks for reporting the data points! I'm actually a bit surprised that Vicuna-7b actually runs with 680M on Yoga Slim 7 pro, given it has only 2GB of GPU memory. Perhaps the results comes from your CPU instead?

junrushao commented 1 year ago

I use cuda as I think Vulkan is not available for A100. I thought A100 should run faster that RTX 30x0 series. Is it possibly due to the cuda driver? Thanks.

@flyman3046 Thanks for asking! We didn't optimize for server-class GPUs like A100 in this release, which means there's huge room to improve on them. We will get there pretty soon :-)

junrushao commented 1 year ago

On Intel Arc A770, using linux drivers installed from https://www.intel.com/content/www/us/en/download/747008/intel-arc-graphics-driver-ubuntu.html, I got

USER: Who is Shohei Ohtani?
encode: 118.6 tok/s, decode: 3.2 tok/s

USER: Write a poem for Pittsburgh.                                                                                                                    
encode: 77.0 tok/s, decode: 3.3 tok/s

USER: Explain why Japanese economy is declining.
encode: 38.0 tok/s, decode: 3.1 tok/s

It is nice to see that it works on Intel dGPU out of the box, but the current decoding perf is a joke haha

@masahi Glad to see it works at all on Intel dGPUs! The encoding performance looks really awesome, but I suspect there's some issues with decoding performance. Not sure what they are

junrushao commented 1 year ago

specs are RX6600XT (8.0gb), Ryzen 5 2600, and 47.9gb of RAM operating at 1866MHz over three calls of /stats and 5 requests made of vicuna, I got the following Encode and Decode: 1st: Encode: 58.8 Tok/S, Decode: 32.1 Tok/S 2nd: Encode: 34.8 Tok/S, Decode: 28.3 Tok/S 3rd: Encode: 81.7 Tok/S, Decode: 31.5 Tok/S

legitimately very impressed! although I would definitely love it if there was a convenient way to launch as the best I've managed is to create a .bat file that sits in the directory and launches anaconda prompt directly into that directory with a reminder of what command I need to type to launch Vicuna.

Thanks for reporting the numbers! This is super exciting for me as well to see data points that MLC LLM works with a legit AMD GPU! In terms of user experience, I believe any app that wraps the shared/static library that MLC-LLM produces could make it look much better than a simple bat file :-) We plan to release a tutorial on how to make good use of those libraries soon

junrushao commented 1 year ago

Hey @iamcodecurious, thanks for sharing the datapoints! Both of your questions are super valuable, and please allow me to walk you through :-)

  1. How exactly do I get this running on my browser? I'm relatively new to much of this so I'm stuck at the first step for WebLLM. It's unclear to me what I'm supposed to download and where/how I run that pip3 install thing.

MLC-LLM uses TVM Unity as the underlying compiler. As you know, the advantage of being a compiler is that it could be re-targeted to various backends, including but not limited to x86/ARM CPUs, AMD/NV/Apple GPUs or phones. To get it run in our browser, we are using TVM Unity to target a backend called WebGPU, and this is different than the one (Vulkan) you are using right now in your command line.

You do not have to install anything for WebGPU (except the latest Chrome) if you don't compile/quantize the model yourself. To get it working, you may use our WebLLM website, which is a webpage running locally on your GPUs without uploading your data or using any server resources - in fact, this is simply a Jekyll static page that embeds WebGPU code.

  1. Is there a way to use other LLMs besides the Vicuna V1 7B? If so please show me.

Yes, the code is there, but we are not officially announcing it yet until we validate very carefully. Will keep the community posted for latest updates!