run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
37.09k stars 5.32k forks source link

[Question]: Why does RagDatasetGenerator adds all metadata keys to the chunk #10311

Closed Kirushikesh closed 10 months ago

Kirushikesh commented 10 months ago

Question Validation

Question

My llama_index library version is : 0.9.11.post1

I was using RagDatasetGenerator to generate training data for finetuning my LLM. But when i checked the prompt which generates the questions from a given chunk, i noticed that my original chunk is divided into two chunks and each chunk has the prefix of all the metadata keys which i guess was not the ideal behavior. I have added the code for replicating.

node_parser = SimpleNodeParser.from_defaults(chunk_size=300, chunk_overlap=10)
service_context = ServiceContext.from_defaults(embed_model=embed_model,node_parser=node_parser, llm=llm)
from llama_index import SimpleDirectoryReader

data = SimpleDirectoryReader(
    input_files=["./sample.txt"]
).load_data()
nodes = node_parser.get_nodes_from_documents(data)
node = nodes[0]
node

Output: TextNode(id='87d95eba-81c4-4ebb-b73d-00dd03f77bc4', embedding=None, metadata={'file_path': 'sample.txt', 'file_name': 'sample.txt', 'file_type': 'text/plain', 'file_size': 47058, 'creation_date': '2024-01-01', 'last_modified_date': '2024-01-01', 'last_accessed_date': '2024-01-27'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='f69c0e46-9294-415c-b400-1993fee3bf81', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'sample.txt', 'file_name': 'sample.txt', 'file_type': 'text/plain', 'file_size': 47058, 'creation_date': '2024-01-01', 'last_modified_date': '2024-01-01', 'last_accessed_date': '2024-01-27'}, hash='51172ad137f0e05f7d03d9e07ce25398e740d862cc4a822811273e65e97d5521'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='9ece568f-6c35-436c-baea-655b87edf9a8', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='195e7c3927f9f0d6ad754873aada2b4e1fe453467917a5d06449c16897263a2b')}, hash='7a45dce34513818f97617551781e7c0d24d4b42fbe1dec0ebdd1e14ee1bd12c8', text="Shivaji Rao Gaikwad[a] (born 12 December 1950), known professionally as Rajinikanth (/rəˌdʒɪnɪˈkɑːnt/ rə-JIN-ih-KAHNTH),[b] is an Indian actor who works mainly in Tamil cinema.[5] In a career spanning over five decades, he has done 169 films that includes films in Tamil, Hindi, Telugu, Kannada, Bengali, and Malayalam. He is widely regarded to be one of the most successful and popular actors in the history of Indian cinema.[6][7] Known for his uniquely styled lines and idiosyncrasies in films, he has a huge fan base internationally and has a cult following. The Government of India honoured him with Padma Bhushan in 2000, Padma Vibhushan in 2016, India's third and second highest civilian honours, and highest award in the field of cinema Dadasaheb Phalke Award in 2019 for his contributions to Indian cinema.[8][9]\n\nFollowing his debut in K. Balachander's 1975 Tamil drama Apoorva Raagangal, Rajinikanth's acting career commenced with a brief phase of portraying antagonistic characters in Tamil films.", start_char_idx=0, end_char_idx=1009, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadataseperator='\n')

dataset_generator = RagDatasetGenerator(
    [node],
    service_context=service_context,
    show_progress=True,
    num_questions_per_chunk=2
)
eval_dataset = dataset_generator.generate_dataset_from_nodes()

When i tried to debug using logging module i found that the LLM input was this:

Context information is below.
---------------------
file_path: sample.txt
file_name: sample.txt
file_type: text/plain
file_size: 47058
creation_date: 2024-01-01
last_modified_date: 2024-01-01
last_accessed_date: 2024-01-27

Shivaji Rao Gaikwad[a] (born 12 December 1950), known professionally as Rajinikanth (/rəˌdʒɪnɪˈkɑːnt/ rə-JIN-ih-KAHNTH),[b] is an Indian actor who works mainly in Tamil cinema.[5] In a career spanning over five decades, he has done 169 films that includes films in Tamil, Hindi, Telugu, Kannada, Bengali, and Malayalam. He is widely regarded to be one of the most successful and popular actors in the history of Indian cinema.[6][7] Known for his uniquely styled lines and idiosyncrasies in films, he has a huge fan base internationally and has a cult following. The Government of India honoured him with Padma Bhushan in 2000, Padma Vibhushan in 2016, India's third and second highest civilian honours, and highest award in the field of cinema Dadasaheb Phalke Award in 2019 for his contributions to Indian cinema.

file_path: sample.txt
file_name: sample.txt
file_type: text/plain
file_size: 47058
creation_date: 2024-01-01
last_modified_date: 2024-01-01
last_accessed_date: 2024-01-27

[8][9]

Following his debut in K. Balachander's 1975 Tamil drama Apoorva Raagangal, Rajinikanth's acting career commenced with a brief phase of portraying antagonistic characters in Tamil films.
---------------------
Given the context information and not prior knowledge.
generate only questions based on the below query.
You are a Teacher/Professor. Your task is to setup                         2 questions for an upcoming                         quiz/examination. The questions should be diverse in nature                             across the document. Restrict the questions to the                                 context information provided.

As you can see the one node passed to the RagDatasetGenerator is broken into two nodes and, the metadata were added to each of the two nodes as a prefix, I guess this is not the ideal behavior as the metadata_mode parameter in RagDatasetGenerator by default is MetadataMode.NONE which doesn't add any metadata to the context.

dosubot[bot] commented 10 months ago

🤖

Thank you for bringing this to our attention. It seems like you've encountered an unexpected behavior with the RagDatasetGenerator class.

The RagDatasetGenerator class is designed to generate a dataset for the RAG (Retrieval-Augmented Generation) model from a list of nodes. It uses the ServiceContext to generate questions from the nodes and creates a LabelledRagDataset from the generated questions.

The metadata_mode parameter in the RagDatasetGenerator class determines how the metadata of the nodes is used. The default value is MetadataMode.NONE, which means that the metadata should not be included in the context. However, from your description, it seems like the metadata is being included in the context even when metadata_mode is set to MetadataMode.NONE.

This could be a bug in the RagDatasetGenerator class. The metadata should not be included in the context when metadata_mode is set to MetadataMode.NONE. The metadata should only be included in the context when metadata_mode is set to MetadataMode.PREFIX or MetadataMode.SUFFIX.

I would recommend checking the version of the LlamaIndex you are using. The latest version is v0.9.39. If you are using an older version, updating to the latest version might solve the issue. If you are already using the latest version, this might be a bug that needs to be fixed.

Unfortunately, as a technical support rep, I don't have the ability to create pull requests or issues. However, I would recommend reporting this issue to the LlamaIndex community so that it can be addressed. You can do this by creating a new issue in the LlamaIndex repository and providing a detailed description of the problem, including the code and the unexpected behavior.

I hope this helps! If you have any other questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 10 months ago

Metadata is by default included in content sent to the LLM

You can disable the metadata by settings

node.excluded_llm_metadata_keys = [...]

logan-markewich commented 10 months ago

I think the culprit is somewhere around here https://github.com/run-llama/llama_index/blob/852dcac86386c4585156ad08e7f7547f0fafedb3/llama_index/llama_dataset/generator.py#L142

My suggested fix above would work around it

Kirushikesh commented 10 months ago

@logan-markewich, in my node if i try to compare the metadata like, print(node.metadata.keys()) Output: _dict_keys(['file_path', 'file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accesseddate']) print(node.excluded_llm_metadata_keys) Output: _['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accesseddate']

If i understood your reply correct, you are asking me to add the metadata keys in the excluded_llm_metadata_keys so that the model ignores those keys when querying right? but as you can see most of the keys are already included in excluded_llm_metadata_keys filed.

Kirushikesh commented 10 months ago

Also, as you already know the RagDatasetGenerator internally uses the SummaryIndex for storing the node and generating questions from. I tried to run it manually and this is the response i got,

from llama_index import Document, ServiceContext, SummaryIndex
from llama_index.schema import MetadataMode

index = SummaryIndex.from_documents(
    [
        Document(
            text=nodes[0].get_content(metadata_mode=MetadataMode.NONE),
            excluded_llm_metadata_keys=node.excluded_llm_metadata_keys,
            excluded_embed_metadata_keys=node.excluded_embed_metadata_keys,
            relationships=node.relationships,
        ),
    ],
    service_context=service_context,
)

query_engine = index.as_query_engine(
    service_context=service_context,
    text_qa_template=dataset_generator.get_prompts()['text_question_template'],
    use_async=True,
)
response = query_engine.query(
    "You are a Teacher/Professor. Your task is to setup 2 questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context information provided.",
)
print(response)

When i try to log the prompt going to the llm, this is what i got,

Context information is below.
---------------------
Shivaji Rao Gaikwad[a] (born 12 December 1950), known professionally as Rajinikanth (/rəˌdʒɪnɪˈkɑːnt/ rə-JIN-ih-KAHNTH),[b] is an Indian actor who works mainly in Tamil cinema.[5] In a career spanning over five decades, he has done 169 films that includes films in Tamil, Hindi, Telugu, Kannada, Bengali, and Malayalam. He is widely regarded to be one of the most successful and popular actors in the history of Indian cinema.[6][7] Known for his uniquely styled lines and idiosyncrasies in films, he has a huge fan base internationally and has a cult following. The Government of India honoured him with Padma Bhushan in 2000, Padma Vibhushan in 2016, India's third and second highest civilian honours, and highest award in the field of cinema Dadasaheb Phalke Award in 2019 for his contributions to Indian cinema.[8][9]

Following his debut in K. Balachander's 1975 Tamil drama Apoorva Raagangal, Rajinikanth's acting career commenced with a brief phase of portraying antagonistic characters in Tamil films.
---------------------
Given the context information and not prior knowledge.
generate only questions based on the below query.
You are a Teacher/Professor. Your task is to setup 2 questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context information provided.
Kirushikesh commented 10 months ago

Sorry, this issue occurs on an older version of llama_index, upgrading the library to newer version resolves the issue.