Closed 3CE8D2BAC65BDD6AA9 closed 3 months ago
Yup, I would recommend defining your own Summarization Model with a custom prompt, something like the following, for example:
from raptor import RetrievalAugmentation, RetrievalAugmentationConfig, BaseSummarizationModel
from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_random_exponential
class LegalGPT3TurboSummarizationModel(BaseSummarizationModel):
def __init__(self, model="gpt-3.5-turbo"):
self.model = model
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def summarize(self, context, max_tokens=500, stop_sequence=None):
try:
client = OpenAI()
response = client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": "You are a helpful legal research assistant."},
{
"role": "user",
"content": f"Write a summary of the following legal text, making sure to include all relevant section numbers and legislation names in the summary: {context}",
},
],
max_tokens=max_tokens,
)
return response.choices[0].message.content
except Exception as e:
print(e)
return e
RAC = RetrievalAugmentationConfig(summarization_model=LegalGPT3TurboSummarizationModel())
RA = RetrievalAugmentation(config=RAC)
RA.add_documents(text)
You can try experimenting with different content and system prompts, or even try stronger models like GPT-4, Claude, Gemini, or Mistral, which may have better performance on legal texts. If you have a sample dataset of legal questions and answers, you could also use DSPy (https://github.com/stanfordnlp/dspy) to automatically optimize the prompt for your specific use case.
I am closing this issue for now. If you have any more questions or have other issues, please feel free to reopen it.
First of all, thanks for publishing the paper and the python codes. Both are easy to follow. I am trying to use RAPTOR to build a backend for legal research. I inputted the legislations with section numbers. But then in the summary steps, the sections number information is lost. Should I amend the ChatGPT prompt to keep the section number information? What are your recommendations on adapting RAPTOR for legal research that requires citation to section numbers and legislation names?