turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

Bad output for 2080 ti #254

Open filipemesquita opened 10 months ago

filipemesquita commented 10 months ago

I am seeing suboptimal output when running on a 2080 ti compared to running on an A100.

1) When running python example_basic.py with Neko-Institute-of-Science/LLaMA-7B-4bit-128g I get this:

Using a 2080 ti: Once upon a time, to 14:25(7) Views (DMAB83A6/F90-990091090091099111111110000099001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111↵111

Using an A100: Once upon a time, there was an American woman who loved her country and believed in its greatness. She understood that America is the greatest nation on earth because of what it stands for: liberty, equality, opportunity and fair play...

2) When running python webui/app.py with TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ I see much better results with the A100 than with the 2080 ti.

I am using this instruction to test it out and the following parameters:

Write a 3-sentence summary about the text below. Write these 3 sentences as a numbered list starting at 1. Each number should be followed by a single sentence only.

The Diffbot Master Plan (Part One)

Our mission at Diffbot is to build the world’s first comprehensive map of human knowledge, which we call the Diffbot Knowledge Graph. We believe that the only approach that can scale and make use of all of human knowledge is an autonomous system that can read and understand all of the documents on the public web.

However, as a small startup, we couldn’t crawl the web on day one. Crawling the web is capital intensive stuff, and many a well-funded startup and large company have gone bust trying to do so. Many of those startups in the late-2000s all raised large amounts of money with no more than an idea and a team to try to build a better Google. However they were never able to build technology that is 10X better before resources ran out. Even Yahoo eventually got out of the web crawling business, effectively outsourcing their crawl to Bing. Bing was spending upwards of $1B per quarter to maintain a fast-follower position.
Temperature: 0.01
Top-K: off
Top-P: 0.75
Min-P: off
Typical: 0.50
Max tokens: 512
Chuck tokens: 128
End of new line: unchecked
Penalty: 1.15
Sustain: 2048
Decay: 512

The output with 2080ti changes a lot even with temperature = 0.01 while the output with the A100 doesn't change much. The 2080 ti often fails to follow the instruction correctly (e.g., outputting only 1-2 sentences) and produces bad text ("a team to try to try to try to build a better Google")

Typical output of A100:

1. The article discusses the mission of Diffbot, which is to create a comprehensive map of human knowledge called the Diffbot Knowledge Graph.
2. To achieve this goal, Diffbot plans to develop an autonomous system that can read and understand all of the documents on the public web.
3. However, the author acknowledges that it's a difficult task due to the high cost involved in crawling the web, and mentions how many other companies failed despite having ample funding.

Some bad outputs from 2080 ti:

1. The article discusses the mission of Diffbot, which is to create a comprehensive map of human knowledge. They plan to achieve this through an autonomous system that can read and understand all of the documents on the public web.
1. The article discusses the author's goal to create a "world's first comprehensive map of human knowledge" using an autonomous system called Diffbot Knowledgeddge Graph. 2. 2. The author explains that this task is difficult due to the high cost of crawling the web, and many other companies have failed in attempting it. 3. The author mentions that Diffbot has been able to overcome this challenge through its autonomous system that can read and understand all of the documents on the public web.
1. The Diffbot Master Plan is a plan to create a comprehensive map of human knowledge through an autonomous system that can read and understand all of the documents on the public web.
2. However, as a small startup, they could not crawl the web on day one. Crawling the web is capital intensive stuff, and many a well-funded startup and large company have went bust trying to do so.
3. However, even Yahoo eventually got out of the web crawling business, effectively outsouring their crawl their crawl to Bing. Bing was spending upwards of $1B of $1B per quartquarter quarter.
mqudsi commented 7 months ago

FWIW, I just loaded an llm locally for the first time on an RTX 2080 and I received perfect results with the latest version of exl2 and cuda/torch, exl2 quantization, 5bpw. (But it’s horrendously slow.)