ollama / ollama

Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
https://ollama.com
MIT License
98.27k stars 7.82k forks source link

What is the minimum requirement for a significant improvement in performance? #1358

Closed oliverbob closed 11 months ago

oliverbob commented 11 months ago

Hi everyone, I have been trying ollama across multiple servers with various specs. I also tested it on the highest package (ram/cpu) at digital ocean. I tested the same on my desktop as well as in my HPE DL380 Gen9 server that is 64GB with the following specs:

lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Stepping: 2 CPU MHz: 1400.000 CPU max MHz: 3200.0000 CPU min MHz: 1200.0000 BogoMIPS: 4794.74 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 1.5 MiB L3 cache: 15 MiB NUMA node0 CPU(s): 0-11 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional c ache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerabl e Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerabl e Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disable d via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __u ser pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IB RS_FW, STIBP conditional, RSB filling, PBRSB -eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx p dpe1gb rdtscp lm constant_tsc arch_perfmon p ebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monito r ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic mov be popcnt tsc_deadline_timer aes xsave avx f 16c rdrand lahf_lm abm cpuid_fault epb invpc id_single pti intel_ppin ssbd ibrs ibpb stib p tpr_shadow vnmi flexpriority ept vpid ept_ ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 e rms invpcid cqm xsaveopt cqm_llc cqm_occup_l lc dtherm ida arat pln pts md_clear flush_l1 d

However, I don't have "any significant" improvement from the performance the intel i5 desktop:

lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz CPU family: 6 Model: 158 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 9 CPU(s) scaling MHz: 91% CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 6000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_ tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cp l vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsav e avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnm i flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflus hopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_ clear flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1 MiB (4 instances) L3: 6 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-3 Vulnerabilities: Itlb multihit: KVM: Mitigation: VMX disabled L1tf: Mitigation; PTE Inversion; VMX conditional cache flushe s, SMT disabled Mds: Mitigation; Clear CPU buffers; SMT disabled Meltdown: Mitigation; PTI Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, I BRS_FW, STIBP disabled, RSB filling Srbds: Mitigation; Microcode Tsx async abort: Not affected

All results showed "very poor" performance on 7b parameter models, so that I began to conclude that I cannot use this in a production environment unless I find a solution. This concerns me a lot because I have production-level implementation with its API that are mission critical for my clients. Also, my company is purchasing servers in "hopes" that ollama will get up to the required speed at least may be even "half the speed" of Chat GPT (for free users). We have several implementations of Modelfiles for each clients, but all of them frustrates our clients and they are losing hope and are very, very upset about the situation. I hope it will not end up with them suing my company. So I came here hoping to save face.

My question:

How can we possibly improve the performance of ollama with the minimum required hardware? Would an upgrade to the latest HPE DL380a Gen11 bring a significant increase in performance to achieve half the performance as that of OpenAI's ChatGPT? For instance, if I will fill all its memory, processors, gpus to the maximum capacity, will that be able to solve/fix this issue? If this will not increase the performance, what SPECIFIC HARDWARE is PROVEN COMPATIBLE without performance issues?

I like Ollama's simplicity of interfacing with the API. Are there any "live" samples we can check performance against? Are there any Ollama API HOSTED online that are PERFORMANCE PROOF that I can grab even if its PAID (just for a temporary fix) while we're looking for solutions?

Will running it on a GPU based cloud solution like AWS, GPC or Azure be worth the investment against the required performance of at least half the speed of ChatGPT demanded by our clients?

Or to simplify my question, what is the minimum required and TESTED hardware configuration to compete with the response speed of ChatGPT?

Any help from anyone on this active community will be appreciated.

Thank you very much.

easp commented 11 months ago

I don't know what "1/2 chat GPT performance is," especially lately. You should be working with specific numbers, rather than your client's feels.

There is a thread on the Ollama Discord with some benchmarks on various physical and cloud hosts.

LLM performance is heavily dependent an scales roughly linearly with available memory bandwidth. That is usually acquired in the form of NVIDIA GPU cards, not via motherboard & CPU RAM channels.

oliverbob commented 11 months ago

I don't know what "1/2 chat GPT performance is," especially lately. You should be working with specific numbers, rather than your client's feels.

There is a thread on the Ollama Discord with some benchmarks on various physical and cloud hosts.

LLM performance is heavily dependent an scales roughly linearly with available memory bandwidth. That is usually acquired in the form of NVIDIA GPU cards, not via motherboard & CPU RAM channels.

Hi easp, thanks for your quick response, by 1/2 the speed of Open-AI's ChatGPT, I mean the response time it will take for the physical server hosting ollama to a json request on say 0.0.0.0:11434.

This is because the response of Ollama across my servers are very, very slow. I'm looking for solutions. I'll try to log into discord to look for your referenced discussion.

technovangelist commented 11 months ago

Thanks for submitting the issue. As mentioned by @easp your best bet will be to add a recent Nvidia GPU to your setup. Most of the work that a model does to generate output is often orders of magnitude faster when performed by a supported GPU. For instance, it's not unusual to see tokens per second numbers of 3 or 4 without a gpu, and as much as 90 with a Nvidia 4080.

I will go ahead and close this now. If you think there is anything we left out, reopen and we can address. Thanks for being part of this great community.