turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

Integration with txtai for RAG #444

Open edwardsmith999 opened 1 month ago

edwardsmith999 commented 1 month ago

Using the tutorial here, it seems creating a general class to wrap ExLlama2 allows it to be used as an LLM for RAG in txtai. I could add this as a file in the example folder (pull request) if useful. Currently, the code below works for me in looking up the most meaningful data item.


    import torch
    from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
    from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler

    from txtai.pipeline import Generation, Extractor, LLM
    from txtai.embeddings import Embeddings

    class EXL2_Generation(Generation):
        def __init__(self, path, template=None, **kwargs):
            super().__init__(path, template, **kwargs)

            self.config = ExLlamaV2Config(path)
            self.model = ExLlamaV2(self.config)
            self.cache = ExLlamaV2Cache(self.model, lazy = True)
            self.model.load_autosplit(self.cache)

            self.tokenizer = ExLlamaV2Tokenizer(self.config)
            self.tokenizer.eos_token = "<|endoftext|>"
            self.tokenizer.pad_token = self.tokenizer.eos_token
            self.generator = ExLlamaV2BaseGenerator(self.model, self.cache, self.tokenizer)
            self.settings = ExLlamaV2Sampler.Settings()

        def execute(self, texts, maxlength, **kwargs):
            results = []
            for text in texts:
                # Run inference
                output = self.generator.generate_simple(text, self.settings, maxlength)

                # Decode results
                output = output[0].split("<|assistant|>\n")[-1].replace("<|endoftext|>", "").strip()
                results.append(output)

            return results

    path = "Meta-Llama-3-8B-Instruct-8.0bpw-h8-exl2/"
    llm = LLM(path, method="__main__.EXL2_Generation")

    template = """<|system|>You are a friendly assistant. You answer questions from users.
    <|user|>
    Find the best matching text in the context for the question. The response should be the text from the context only.

    Question:
    {question}

    Context:
    {context}

    <|assistant|>
    """

    data = [
      "US tops 5 million confirmed virus cases",
      "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
      "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
      "The National Park Service warns against sacrificing slower friends in a bear attack",
      "Maine man wins 25 lottery ticket",
      "Make huge profits without work, earn up to $100,000 a day"
    ]

    # Create embeddings
    embeddings = Embeddings(content=True, autoid="uuid5")

    # Create an index for the list of text
    embeddings.index(data)

    # Create and run extractor instance
    extractor = Extractor(embeddings, llm, output="reference", separator="\n", template=template)

    result = extractor("Tell me something about about wildlife")
    print("REFERENCE:", embeddings.search("select id, text from txtai where id = :id", parameters={"id": result["reference"]}))

    result = extractor("Tell me something about about Canada")
    print("REFERENCE:", embeddings.search("select id, text from txtai where id = :id", parameters={"id": result["reference"]}))

which gives

    REFERENCE: [{'id': '7224f159-658b-5891-b06c-9a96cfa6a54d', 'text': 'The National Park Service warns against sacrificing slower friends in a bear attack'}]
    REFERENCE: [{'id': 'da633124-33ff-58d6-8ecb-14f7a44c042a', 'text': "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg"
turboderp commented 1 month ago

I'm not familiar with txtai, but I'm pretty sure there would be some more boilerplate required for this to work reliably. Also I'm a little confused by the prompt format. Doesn't seem to be correct for Llama3-instruct?

edwardsmith999 commented 1 month ago

Thanks @turboderp, txtai came up as a RAG alternative to llamachain, discussed in issue #261, but llamachain seems more complex to get working.

Prompt is probably not correct for Llama3, I took the example from the txtai website for a different model and found this form at least returned the expect result. Leaving this here as a code snippet for anyone interested might be best for now.