parthsarthi03 / raptor

The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
https://arxiv.org/abs/2401.18059
MIT License
688 stars 98 forks source link

Want to use RAPTOR for legal research. How to add legislation citation? #11

Closed 3CE8D2BAC65BDD6AA9 closed 3 months ago

3CE8D2BAC65BDD6AA9 commented 3 months ago

First of all, thanks for publishing the paper and the python codes. Both are easy to follow. I am trying to use RAPTOR to build a backend for legal research. I inputted the legislations with section numbers. But then in the summary steps, the sections number information is lost. Should I amend the ChatGPT prompt to keep the section number information? What are your recommendations on adapting RAPTOR for legal research that requires citation to section numbers and legislation names?

parthsarthi03 commented 3 months ago

Yup, I would recommend defining your own Summarization Model with a custom prompt, something like the following, for example:

from raptor import RetrievalAugmentation, RetrievalAugmentationConfig, BaseSummarizationModel
from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_random_exponential

class LegalGPT3TurboSummarizationModel(BaseSummarizationModel):
    def __init__(self, model="gpt-3.5-turbo"):
        self.model = model

    @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
    def summarize(self, context, max_tokens=500, stop_sequence=None):
        try:
            client = OpenAI()
            response = client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are a helpful legal research assistant."},
                    {
                        "role": "user",
                        "content": f"Write a summary of the following legal text, making sure to include all relevant section numbers and legislation names in the summary: {context}",
                    },
                ],
                max_tokens=max_tokens,
            )
            return response.choices[0].message.content
        except Exception as e:
            print(e)
            return e

RAC = RetrievalAugmentationConfig(summarization_model=LegalGPT3TurboSummarizationModel())
RA = RetrievalAugmentation(config=RAC)
RA.add_documents(text)

You can try experimenting with different content and system prompts, or even try stronger models like GPT-4, Claude, Gemini, or Mistral, which may have better performance on legal texts. If you have a sample dataset of legal questions and answers, you could also use DSPy (https://github.com/stanfordnlp/dspy) to automatically optimize the prompt for your specific use case.

parthsarthi03 commented 3 months ago

I am closing this issue for now. If you have any more questions or have other issues, please feel free to reopen it.