ollama / ollama

Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
https://ollama.com
MIT License
96.03k stars 7.63k forks source link

GPU Usage Never Exceeds 70% When Using LLaMA 3:8B with Ollama #6163

Closed drspam1991 closed 1 week ago

drspam1991 commented 3 months ago

What is the issue?

When using the Ollama tool with the LLaMA 3:8B model and all 33 offload layers loaded on the GPU, the GPU usage never goes over 70%. This seems suboptimal and may indicate an issue with how resources are being utilized.

Screenshot from 2024-08-04 17-24-02

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.2.8

rick-github commented 3 months ago

What's the prompt? What's the CPU usage? What's the output of nvidia-smi -q -d POWER,TEMPERATURE? What's the output of nvidia-smi -q | grep -A9 'Clocks Throttle Reasons'?

drspam1991 commented 3 months ago

Thank you for reply.

I use 'hey' load test tool to send parallel request to ollama with this prompt: "Consider the following Text and list of Topics: \n Text: 'مقایسه آماری ۲ دوره متوالی حراج هنر مدرن و معاصر/ چرا فروش کلی کاهش یافت؟\n' \n Topics: ( Painting|Sculpture|Photography|Drawing|Digital Art|Visual Arts|Theater|Dance|Music|Opera|Performing Arts|Poetry|Fiction|Non-Fiction|Play Script|Literary Arts|Art Movements|Art Styles|Censorship|Art Sales|Book|Events and Festivals|Graphic Design|Interior Design|Industrial Design|Fashion Design|Crafts|Architecture|Applied Arts|International Trade|Unemployment|Learning|Financial Crime|Book|Tax|Inflation|Import/Export|Currency|Banking|Accounting|Blockchain|Real Estate|Valuable Metals|Ministry of Economy|Ministry of Industry|Exchange|Brokerage|Marketing|Labor Market|Labor Migration|Wages and Benefits|Insurance|Investing|Saving|Retirement|Personal Finance|Stock Market|Forex Market|Crypto Currency|Gold|Financial Markets ). \n Choose most related topic(s) to the Text from the list. Your output must be topics name from the list with format [t1, ...] and say nothing else."

This is my CPU usage during test: Screenshot from 2024-08-04 21-19-24

Output of nvidia-smi -q -d POWER,TEMPERATURE : ==============NVSMI LOG==============

Timestamp : Sun Aug 4 21:13:32 2024 Driver Version : 545.29.06 CUDA Version : 12.3

Attached GPUs : 1 GPU 00000000:0B:00.0 Temperature GPU Current Temp : 54 C GPU T.Limit Temp : N/A GPU Shutdown Temp : 100 C GPU Slowdown Temp : 97 C GPU Max Operating Temp : 88 C GPU Target Temperature : 83 C Memory Current Temp : N/A Memory Max Operating Temp : N/A GPU Power Readings Power Draw : 2.93 W Current Power Limit : 225.00 W Requested Power Limit : 225.00 W Default Power Limit : 225.00 W Min Power Limit : 125.00 W Max Power Limit : 280.00 W Power Samples Duration : 52.45 sec Number of Samples : 119 Max : 42.78 W Min : 2.88 W Avg : 4.44 W GPU Memory Power Readings Power Draw : N/A Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

and nvidia-smi -q | grep -A9 'Clocks Throttle Reasons' has not any result

rick-github commented 3 months ago

Output of nvidia-smi -q -d POWER,TEMPERATURE,PERFORMANCE while the test is running? What is OLLAMA_NUM_PARALLEL set to? Can you include some server logs?

igorschlum commented 3 months ago

@drspam1991 did you tried with another model than LLama3:8B? What are the results?

drspam1991 commented 3 months ago

Output of nvidia-smi -q -d POWER,TEMPERATURE,PERFORMANCE while the test is running? What is OLLAMA_NUM_PARALLEL set to? Can you include some server logs?

This is the result of nvidia-smi -q -d POWER,TEMPERATURE,PERFORMANCE while test is running:

==============NVSMI LOG==============

Timestamp : Mon Aug 5 12:57:09 2024 Driver Version : 545.29.06 CUDA Version : 12.3

Attached GPUs : 1 GPU 00000000:0B:00.0 Performance State : P2 Clocks Event Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active Temperature GPU Current Temp : 77 C GPU T.Limit Temp : N/A GPU Shutdown Temp : 100 C GPU Slowdown Temp : 97 C GPU Max Operating Temp : 88 C GPU Target Temperature : 83 C Memory Current Temp : N/A Memory Max Operating Temp : N/A GPU Power Readings Power Draw : 157.68 W Current Power Limit : 225.00 W Requested Power Limit : 225.00 W Default Power Limit : 225.00 W Min Power Limit : 125.00 W Max Power Limit : 280.00 W Power Samples Duration : 2.37 sec Number of Samples : 119 Max : 246.08 W Min : 60.98 W Avg : 159.74 W GPU Memory Power Readings Power Draw : N/A Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

The OLLAMA_NUM_PARALLEL is set to 5. When I increase this environment variable from 5, the throughput drops

And this is the ollama log: ollama.log

rick-github commented 2 months ago

It looks there is some clock throttling happening: SW Power Cap : Active. I'm not sure this fully explains the low GPU usage, but it's worth looking in to.

From the nvidia-smi manual page:

SW Power Cap SW Power Scaling algorithm is reducing the clocks below
 requested clocks because the GPU is consuming too much
 power. E.g. SW power cap limit can be changed with
 nvidia-smi --power-limit=

But you may not be able to do anything about this by adjusting the power-limit:

   -pl, --power-limit=POWER_LIMIT
       Specifies maximum power limit in watts.  Accepts integer and floating point numbers.
       Only on supported devices from Kepler family.  Requires administrator privileges.
       Value  needs  to be between Min and Max Power Limit as reported by nvidia-smi.

From Wikipedia:

Kepler is the codename for a GPU microarchitecture developed by Nvidia, first introduced at retail in April 2012,[1] as the successor to the Fermi microarchitecture. Kepler was Nvidia's first microarchitecture to focus on energy efficiency. Most GeForce 600 series, most GeForce 700 series, and some GeForce 800M series GPUs were based on Kepler, all manufactured in 28 nm. Kepler found use in the GK20A, the GPU component of the Tegra K1 SoC, and in the Quadro Kxxx series, the Quadro NVS 510, and Tesla computing modules.

From GEFORCE_RTX_2080_User_Guide.pdf:

The GeForce® RTX 2080 is powered by the all-new NVIDIA Turing™ architecture to give you incredible new levels of gaming realism, speed, power efficiency, and immersion. This is graphics reinvented.

I also noticed in your first screencap of nvtop that your temperature was at 81°C, and your GPU Target Temperature is 83°C, so you may also be experiencing SW Thermal Slowdown.

You can monitor the GPU temperature, power draw and clock rate in nvtop by adjusting the settings in Setup > Chart > Displayed all GPUs.

pdevine commented 1 month ago

@drspam1991 did you end up sorting this out?