qhjqhj00 / MemoRAG

Empowering RAG with a memory-based data interface for all-purpose applications!
Apache License 2.0
1.2k stars 74 forks source link

Inconsistent/Poor memorag Perfomance on Medium/Large-Sized Documents #30

Closed jvel07 closed 1 month ago

jvel07 commented 1 month ago

@qhjqhj00 again, nice repo!

I am using a legal Spanish language (bge-m3 covers Spanish as well) PDF of around 150 pages; and, even using the small_part of it as below, qa performs better (although still not decent) than memorag.

Using latest memorag version and Llama3.1 inst:

import requests
import tiktoken
from PyPDF2 import PdfReader
from memorag import MemoRAG

pipe = MemoRAG(
    mem_model_name_or_path="meta-llama/Meta-Llama-3.1-8B-Instruct",
    ret_model_name_or_path="BAAI/bge-m3",
    beacon_ratio=None,
    load_in_4bit=True,
    enable_flash_attn=False, # T4 GPU does not support flash attention
    access_token='token'
)

encoding = tiktoken.get_encoding("cl100k_base")

# url = 'https://raw.githubusercontent.com/qhjqhj00/MemoRAG/main/examples/harry_potter.txt'
# response = requests.get(url)
# content = response.text

pdf_file_path = 'corpora/legal.pdf' 

reader = PdfReader(pdf_file_path)
content = ""
for page in reader.pages:
    content += page.extract_text()

print(f"The raw database has {len(encoding.encode(content))} tokens...")

small_part = " ".join(content.split()[:50000])
print(f"Using part of the database: with {len(encoding.encode(small_part))} tokens...")

pipe.memorize(small_part, save_dir="constitution/", print_stats=True)

For a simple question in Spanish like: "what is the article 8 about?"

May I ask, am I missing something or is this the expected performance reported on the paper?

qhjqhj00 commented 1 month ago

We have not yet tested MemoRAG on Spanish, and the built-in prompts are currently available only in English and Chinese. This may be a contributing factor to the observed performance degradation. In future development versions, I plan to test additional languages; however, I cannot guarantee optimal performance across multiple languages at this stage.