[Bug]: Compression truncates words and sentences

younes-io commented 4 months ago

Describe the bug

I used the code in the README and also in the notebook. Check the code below.

Steps to reproduce


from langchain_community.document_loaders import TextLoader

from langchain_text_splitters import RecursiveCharacterTextSplitter

documents = TextLoader(
    encoding="utf-8"
    "./docs/long_legal_text.txt",
).load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(documents)

hybrid_db = Neo4jVector.from_documents(
                texts,
                embeddings,
                url=url,
                username=username,
                password=password,            
                search_type="hybrid",
                pre_delete_collection=True,
                index_name="index_name_llm_lingua",
                keyword_index_name="keyword_name_llm_lingua"
            )

retriever = hybrid_db.as_retriever(search_kwargs={'k': 8})

query = "une société minière CHINAHCC qui avait obtenu le statut A en 2020 et ayant realisé un benefice net de 1.5 million d'euros en 2023, souhaite savoir combien d'impôts elle va payer en 2023 ?"

docs = retriever.get_relevant_documents(query)
pretty_print_docs(docs) # I get a list of docs with the right answer (but it's tricky because there are other tax rates that do not apply to the company's situation

## Compression code starts here...

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import LLMLinguaCompressor
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

compressor = LLMLinguaCompressor(model_name="openai-community/gpt2", device_map="cpu")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.get_relevant_documents(
    query
)
pretty_print_docs(compressed_docs) ## I get weird characters and even truncated words/sentences...

I get this for example:

-20%, tit�erc ouvert à comp duvier 20.- taux spc de 15%é aux soci installesélérationrielle » et à celles ayant le stat A » esté comme :-,25 tit�ercuvert à janvier 2023 ; 17,50%, au titre de l�exercice ouvert à du 2024 ; -,%, aure de l’ice ou àtervier 2025#2>

Expected Behavior

My original retriever Neo4j does retrieve the data in utf-8 (especially that I use the French language), but after compression, it's a mess, unfortunately...

For example, I get this after compression: à comp duvier 20. (meaningless) which is originally à compter du 1 janvier 2024

Logs

No response

Additional Information

LLMLingua Version: 0.1.6 Operating System: WSL2 (in DOcker) Python Version:

iofu728 commented 4 months ago

Hi @younes-io, thank you for your support and the detailed issue information.

Although prompts compressed by LLMLingua might have garbled text and be hard for humans to understand, I acknowledge that "à comp duvier 20." indeed lost crucial information.

However, I suspect this might be due to the weaker semantic capability of GPT-2. You might consider using LLaMA or another SLM as a compressor, like compressor = LLMLinguaCompressor(model_name="NousResearch/Llama-2-7b-hf", device_map="cpu").

younes-io commented 4 months ago

Hi @younes-io, thank you for your support and the detailed issue information.

Although prompts compressed by LLMLingua might have garbled text and be hard for humans to understand, I acknowledge that "à comp duvier 20." indeed lost crucial information.

However, I suspect this might be due to the weaker semantic capability of GPT-2. You might consider using LLaMA or another SLM as a compressor, like compressor = LLMLinguaCompressor(model_name="NousResearch/Llama-2-7b-hf", device_map="cpu").

@iofu728 : Thanks for the feedback. I tried this "NousResearch/Llama-2-7b-hf" but it's "heavy" for my testing purposes.. I'll have to allocate more resources.. anything else I could use (more lightweight) ?

iofu728 commented 4 months ago

Hi @younes-io, maybe you can try "microsoft/phi-2" model.

microsoft / LLMLingua