run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.27k stars 5.17k forks source link

[Feature Request]: Incorporate Xorbits Inference for In-Production Distributed Deployment #6845

Closed Bojun-Feng closed 11 months ago

Bojun-Feng commented 1 year ago

Feature Description

We should incorporate a distributed deployment framework like Xorbits Inference so we can scale the deployment with custom models. From what I've heard someone is working on incorporating it into LangChain, but I thought it would be nice to have a separate plugin directly for this repository.

Value of Feature

This would be beneficial to companies that want to improve the performance and scalability of their applications without relying solely on cloud-based solutions.

jon-chuang commented 1 year ago

Hi @Bojun-Feng , thanks for the suggestion. May I know how this might compare to using Ray serve?

Further, langchain and llamaindex do not just provide custom models, but also a pipeline built around various inference endpoints or models. May I know what xorbit is optimized around and whether it could support this use case?

Definitely open to contributions and to hearing more details.

Bojun-Feng commented 1 year ago

Thank you for your response! Regarding your first question, Ray Serve is primarily focused on deployment, whereas we not only on deployment but the generative aspect itself. Our framework can be seen as a GGML version of the llma_index repository's huggingface.py file, offering similar functionalities.

We are excited to introduce our distributed GGML framework, which enables parallel inference of multiple instances of GGML models. With GGML models, we can achieve satisfactory results with reduced computing power, allowing the models to run on commercial devices and increasing the number of potential users. We are also actively working on supporting Pytorch to get the best of both worlds.

Regarding llms, we provide an API similar to OpenAI, ensuring compatibility with existing models. Our framework supports both local and distributed inference, providing flexibility in deployment options. All of these features make it easy to transition into Xorbit inference.

Thank you again for your response. I hope this helps answer your questions, if anything is unclear please reach out and I will respond asap.

jon-chuang commented 1 year ago

Hello @Bojun-Feng thanks again for the explanation.

I am interested in this idea of a distributed GGML framework. To my understanding, GGML only need to allocate incremental scratch space memory and can benefit from shared memory for the weights. In our experience, models of scales 13B and below do not produce high-quality outputs.

May I know if you have managed to serve better models such as Falcon 40B? Further, are you able to test models like Orca 30B, or MPT-30B? To my understanding, these models are not officially supported by GGML. However, I think it is worth looking into the 4-bit quantized models via bitsandbytes which should work on any huggingface transformer.

It would be great if you could open up one of your endpoints to test the inference speed in a llama_index pipeline on something like 4-bit quantized Falcon 40B. If it is fast and easy to deploy on a high end local machine (e.g. someone's desktop) perhaps it could be a good setup for development or even production at a later time.

Here are some examples of 4bit-40B:

  1. https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ
  2. https://huggingface.co/TheBloke/falcon-40b-instruct-GGML

To my understanding these are actually impractical at the moment. For instance, I am seeing 0.7 tokens/s. However, I am definitely keen to see how these local models will improve in performance and resource requirements over time.

Bojun-Feng commented 1 year ago

Hello @jon-chuang, thank you for the feedback!

We have successfully tested out the performance of Vicuna 33b v1.3 on a single 3090ti with cuBLAS and received at least 20 tokens per second in every single inference trial using Xorbits inference. Vicuna 33b is a very powerful model that is highly ranked on llm leaderboards, capable of handling complex tasks and delivering impressive results. As a result, we were very happy to see a model of this size working on a consumer graphics card at this rate.

The mean inference speed is 22.19 tokens per second, with a standard deviation of 1.11. We are also actively working on other large models, such as Falcon 40b or the Llama 2 series.

Here is an example trial:

Prompt:

This is a classic proof of why there exists an infinite number of prime numbers named Euclid's theorem, explained by an expert math professor in a step-by-step fashion:

Response:

Euclid's Theorem states that there are infinitely many primes. To prove this, we will use a proof by contradiction. Suppose that there are only finitely many primes, p1, p2, ..., pm. Now, consider the number N = p_1 * p_2 * ... * p_m + 1.

Since all prime factors of N have been used (p1, p2, ..., pm), and N is just one larger than their product, N cannot be divisible by any of these primes. Thus, N is a prime number that is not among the first m primes. This contradicts our assumption that there are only finitely many primes. Therefore, there must be infinitely many prime numbers.

To Reproduce 1. Install [Xorbits inference](https://github.com/xorbitsai/inference) and [Llama-cpp-python](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal) with cuBLAS 2. Run `xinference --host "localhost" --port 9937` with terminal 3. Run the Python Code provided below using iPython in a new terminal If you chose to use a different port, remember to also change the port number in Python Code
Python Code ``` import time from xinference.client import Client client = Client("http://localhost:9937") model_uid = client.launch_model(model_name="vicuna-v1.3", model_size_in_billions=33, quantization="q4_0") model = client.get_model(model_uid) rounds = 5 times = [] tokens = [] speeds = [] for i in range(rounds): start = time.time() output = model.generate(prompt="This is a classic proof of why there exists an infinite number of prime numbers named Euclid's theorem, explained by an expert math professor in a step-by-step fashion:", generate_config={'max_tokens': 512, 'stream': False}) end = time.time() times.append(end-start) tokens.append(output['usage']['completion_tokens']) speeds.append(tokens[i]/times[i]) print(f"{i+1}/{rounds}") print(output['choices'][0]['text']) print("Time taken:", times[i]) print("Tokens generated:", tokens[i]) print("Tokens per second:", speeds[i]) print("-"*20) avg = sum(speeds) / len(speeds) var = sum([(speeds[i]-avg)**2 for i in range(rounds)]) / len(speeds) std = var**0.5 print("\n\n") print("Average speed (tokens per second):", avg) print("Variance:", var) print("Standard deviation:", std) print("\n\n") print("Times: ", times) print("Tokens: ", tokens) print("Speeds: ", speeds) ```
More Data & Statistics | Trials | Time | Tokens | Speed | | --- | --- | --- | --- | | 1 | 7.36 | 164 | 22.29 | | 2 | 23.83 | 499 | 20.94 | | 3 | 24.52 | 512 | 20.88 | | 4 | 12.19 | 278 | 22.81 | | 5 | 22.03 | 467 | 21.20 | | 6 | 8.04 | 164 | 20.40 | | 7 | 6.21 | 150 | 24.17 | | 8 | 10.22 | 237 | 23.19 | | 9 | 11.12 | 256 | 23.02 | | 10 | 16.45 | 363 | 22.07 | | 11 | 4.94 | 121 | 24.48 | | 12 | 15.78 | 350 | 22.18 | | 13 | 12.13 | 277 | 22.84 | | 14 | 18.10 | 394 | 21.77 | | 15 | 11.27 | 259 | 22.98 | | 16 | 24.56 | 512 | 20.85 | | 17 | 18.25 | 397 | 21.76 | | 18 | 24.56 | 512 | 20.85 | | 19 | 10.57 | 244 | 23.09 | | 20 | 16.14 | 356 | 22.06 | | Average | 14.91 | 325.6 | 22.19 | | Standard Deviation | 6.32 | 126.56 | 1.11 |
Bojun-Feng commented 1 year ago

Hi @jon-chuang , just wanted to follow up regarding our previous correspondence. Was the information helpful in answering your questions? Please let me know if you have any other questions or concerns.

jon-chuang commented 1 year ago

Hello @Bojun-Feng , we are especially excited for Llama-2 models, anw, feel free to make a PR to create an LLM from the Xorbits inference API.

https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/base.py https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/replicate.py

Following the latter replicate one could be a good start.

Bojun-Feng commented 1 year ago

Hi @jon-chuang , thank you so much for this information! I am glad to share with you that Xorbits inference is already compatible with 7B and 13B models of Llama-2, and we are working on the 70B model right now. Looking forward to seeing the chemical reaction between llama-index and Llama-2 70B!

dosubot[bot] commented 1 year ago

Hi, @Bojun-Feng! I'm Dosu, and I'm helping the LlamaIndex team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on the discussion between you and Jon-chuang, it seems that you both have discussed the incorporation of Xorbits Inference for in-production distributed deployment. You mentioned that Xorbits focuses on both deployment and the generative aspect, providing similar functionalities to llamaindex. You also mentioned successful tests with Vicuna 33b and your work on other large models. Jon-chuang expressed interest in the distributed GGML framework and suggested making a PR to create an LLM from the Xorbits inference API. You responded positively and mentioned compatibility with Llama-2 models.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contributions to the LlamaIndex repository!