oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.39k stars 5.18k forks source link

Implement NBCE, a recent trick to extend any LLM's context length via naive bayes #2396

Closed kabachuha closed 1 year ago

kabachuha commented 1 year ago

Description

Naive Bayes-based Context Extension is a method that uses the idea of naive Bayes to extend the context handling length of large language models as long as there is enough computing power. It can be applied to any model without fine-tuning or relying on the model architecture, and has linear efficiency and decent performance. It's quite fresh and their repo was created just 5 days ago.

Additional Context

You can read more about it and see its implementation here: https://github.com/bojone/NBCE/blob/main/README_en.md.

I think it would be great if you could implement NBCE in your webui, so that users can generate text based on long context from multiple sources. For example, users could input several paragraphs from different web pages or documents, and the webui would generate a summary or a continuation based on all of them. This would enable more creative and diverse text generation scenarios.

toast22a commented 1 year ago

This looks interesting. Are there any evaluation results for this method? I checked the GitHub page you linked, but I can't find anything about model performance other than this:

Latest test results: Under 8*A800, the 7B model can handle 50k context and perform reading comprehension accurately.

jllllll commented 1 year ago

This looks interesting. Are there any evaluation results for this method? I checked the GitHub page you linked, but I can't find anything about model performance other than this:

Latest test results: Under 8*A800, the 7B model can handle 50k context and perform reading comprehension accurately.

320GB of VRAM for 50k context just isn't good enough to be worth it, in my opinion.

tensiondriven commented 1 year ago

@jllllll This would likely scale down though; if the ratio is truly linear, then one could run 4K context on llama's on commodity hardware. That alone would make it worth it, as context length is probably the #1 limiting factor we're facing right now (at least IMO).

The repo and the blog post are light on details, as far as I can tell, so it's probably too early to integrate into text-generation-webui, but we know how quickly that cna change :)

If this worked with the latest bitsandbytes 4bit breakthrough, that'd make it a no-brainer. I think it's at least worth looking into the feasibility of it.

bojone commented 1 year ago

Hello everyone, I am the author of NBCE, and I apologize if my previous statement may have caused some confusion. I mentioned that I used 8 * A800 to handle a 50k-length Context, but in fact, it did not use up all the available memory, and only used about 160GB.

NBCE works by inputting "Query", "Context1+Query", "Context2+Query", ..., "Context_n+Query" as a batch into the LLM. Therefore, its efficiency regarding the number of Contexts is obviously linear. Moreover, it is model-agnostic, and we can apply it to LLAMA, RWKV, or any other Language Models (LM).

The test.py provided by Github should also be easy to understand. You only need to make simple modifications to the sampling code to implement NBCE.

Hope you will enjoy it.

tensiondriven commented 1 year ago

It's worth noting that "about 160GB of vram is used" which still puts this out of reach of most consumer/prosumer users.

I am wondering how much VRAM would be used if the context length was limited to:

Perhaps given a small enough context length, this would still be feasible?

Also, how reliable is this at retrieving details in the extended context? (I was not able to read the blog post)

bojone commented 1 year ago

It's worth noting that "about 160GB of vram is used" which still puts this out of reach of most consumer/prosumer users.

I am wondering how much VRAM would be used if the context length was limited to:

  • 4K
  • 8k
  • 12k

Perhaps given a small enough context length, this would still be feasible?

Also, how reliable is this at retrieving details in the extended context? (I was not able to read the blog post)

Since it was just for experimentation, I used bf16 and did not further quantize to save GPU memory. Based on my previous tests, a single A800 card can handle around 12k Context.

I don't have any experience with quantization acceleration, so I can't give the minimum GPU memory requirement relationship. However, NBCE's efficiency regarding the number of Contexts is indeed linear, so you can estimate the required GPU memory based on this characteristic, i.e., estimating the GPU memory needed for inputting "Query", "Context1+Query", "Context2+Query", ..., "Context_n+Query" as a batch into the LLM.

Additionally, we can trade time for space, i.e., sequentially input "Query", "Context1+Query", "Context2+Query", ..., "Context_n+Query" into the LLM for prediction (batch_size=1), and then aggregate the results. In this way, there is almost no increase in GPU memory consumption, but the speed will be significantly reduced.

bojone commented 1 year ago

Also, how reliable is this at retrieving details in the extended context? (I was not able to read the blog post)

As I mainly work on Chinese NLP, I have not conducted a systematic comparison. In our limited test cases, the general conclusion is: NBCE performs better than PCW (Parallel Context Window), especially when the total Context length exceeds three times the training length. In such cases, PCW tends to produce random and imprecise results. However, NBCE can still provide correct answers until the total Context length reaches 50k.

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

amurtadha commented 10 months ago

@toast22a I have conduct a comparative evaluation and the results look promising. I will share with you guys soon