Inconsistent/Poor memorag Perfomance on Medium/Large-Sized Documents

@qhjqhj00 again, nice repo!

I am using a legal Spanish language (bge-m3 covers Spanish as well) PDF of around 150 pages; and, even using the small_part of it as below, `qa` performs better (although still not decent) than `memorag`.

Using latest memorag version and Llama3.1 inst:

import requests
import tiktoken
from PyPDF2 import PdfReader
from memorag import MemoRAG

pipe = MemoRAG(
    mem_model_name_or_path="meta-llama/Meta-Llama-3.1-8B-Instruct",
    ret_model_name_or_path="BAAI/bge-m3",
    beacon_ratio=None,
    load_in_4bit=True,
    enable_flash_attn=False, # T4 GPU does not support flash attention
    access_token='token'
)

encoding = tiktoken.get_encoding("cl100k_base")

# url = 'https://raw.githubusercontent.com/qhjqhj00/MemoRAG/main/examples/harry_potter.txt'
# response = requests.get(url)
# content = response.text

pdf_file_path = 'corpora/legal.pdf' 

reader = PdfReader(pdf_file_path)
content = ""
for page in reader.pages:
    content += page.extract_text()

print(f"The raw database has {len(encoding.encode(content))} tokens...")

small_part = " ".join(content.split()[:50000])
print(f"Using part of the database: with {len(encoding.encode(small_part))} tokens...")

pipe.memorize(small_part, save_dir="constitution/", print_stats=True)

For a simple question in Spanish like: "what is the article 8 about?"

qa --> outputs a not so detailed but accurate answer (in Spanish), but
memorag --> outputs (in Spanish) "no information on article 8 is given in the text provided"

qhjqhj00 / MemoRAG

Inconsistent/Poor memorag Perfomance on Medium/Large-Sized Documents #30

@qhjqhj00 again, nice repo!

I am using a legal Spanish language (bge-m3 covers Spanish as well) PDF of around 150 pages; and, even using the small_part of it as below, `qa` performs better (although still not decent) than `memorag`.

Using latest memorag version and Llama3.1 inst:

For a simple question in Spanish like: "what is the article 8 about?"

May I ask, am I missing something or is this the expected performance reported on the paper?

qhjqhj00 / MemoRAG

Inconsistent/Poor memorag Perfomance on Medium/Large-Sized Documents #30

@qhjqhj00 again, nice repo!

I am using a legal Spanish language (bge-m3 covers Spanish as well) PDF of around 150 pages; and, even using the small_part of it as below, qa performs better (although still not decent) than memorag.

Using latest memorag version and Llama3.1 inst:

For a simple question in Spanish like: "what is the article 8 about?"

May I ask, am I missing something or is this the expected performance reported on the paper?

I am using a legal Spanish language (bge-m3 covers Spanish as well) PDF of around 150 pages; and, even using the small_part of it as below, `qa` performs better (although still not decent) than `memorag`.