run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.22k stars 5.16k forks source link

[Question]: Which is the original text after using TopicNodeParse? #16500

Open 912100012 opened 1 week ago

912100012 commented 1 week ago

Question Validation

Question

After using the TopicNodeParser,I found that the text property of node(chunk) is the summary by llm. But i want find the original text from the document to retrieve.Which position it save,please tell me.

dosubot[bot] commented 1 week ago

The TopicNodeParser creates separate nodes for the summary and the original text. The summary is stored in an IndexNode, while the original text is stored in a TextNode. The TextNode includes both the summary and the original content serialized as markdown, allowing you to access the original text from there [1].

To continue talking to Dosu, mention @dosu.

912100012 commented 1 week ago

watch my code! The "nodes" I got is the the TextNode type list,ech node only has the text after llm summary,no original text,so where is the "The summary is stored in an IndexNode, while the original text is stored in a TextNode" you said?

import os

from llama_index.core import Document
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.node_parser.topic import TopicNodeParser

from demo.TopicNodeParse.CustomEntityExtractor import CustomEntityExtractor
from pro_settings.my_settings import MySetting

os.environ["https_proxy"] = "http://localhost:60960"
os.environ["OPENAI_API_KEY"] = "sk-proj-NcTT4syiWlE3mQeVbw9kT3BlbkFJGg6JwiFERq7S4tLePJal"
embed_model = OpenAIEmbedding()
llm = OpenAI(model="gpt-4o-mini")
MySetting.open_log()

documents = MySetting.get_test_doc(file='../../data/绿毛水怪mini.txt')
node_parser = TopicNodeParser.from_defaults(
    llm=llm,
    max_chunk_size=1000,
    similarity_method="llm",  # can be "llm" or "embedding"
    window_size=5,  # paper suggests window_size=5
)
text = '''
  好,在小学的一班学生之中,有了一个“怪物”就够了吧,但是事情偏不如此。班上还有个女生,也是一样的精灵古怪,因为她太精,她妈管她叫“人妖”。
  这个称呼就被同学当作她的外号了。当然了,一般来说,叫一个女生的外号是很下流的。因此她的外号就变成了一个不算难听的昵称“妖妖”。这样就被叫开了,她自己也不很反感。
  喂,你不要笑,我知道你现在一定猜出了她就是那个水怪杨素瑶。你千万不要以为我会给你讲一个杜撰的故事,说她天天夜里骑着笤帚上天。
  这样事情是不会有的,而我给你讲的是一件真事呢。我记得有那么一天,班上来了一位新老师,原来我们的班主任孙老师升了教导主任了,我们都在感谢上苍:老天有眼,把我们从一位阎王爷手底下救出来了。
'''
documents = [Document(text=text)]

entity_extractor = CustomEntityExtractor()

transformations = [node_parser]

docstore = SimpleDocumentStore()

from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(transformations=transformations, docstore=docstore)

nodes = pipeline.run(documents=documents)

print(nodes)

@dosu

dosubot[bot] commented 1 week ago

The TopicNodeParser processes documents into topic-based nodes, and the resulting nodes typically contain summarized text. If you're only seeing summaries in your TextNode instances, it might be due to how the TopicNodeParser is configured or how the nodes are being processed in your pipeline.

To access the original text, ensure that the TextNode includes both the summary and the original content. The TextNode class has methods to handle content and metadata, which might be useful for retrieving the original text. Check if the get_content method or similar methods are being used to extract the text, as they might be configured to return only the summary.

If the original text is not being stored as expected, you might need to adjust the configuration or processing logic to ensure both the summary and original text are retained in the nodes.

To continue talking to Dosu, mention @dosu.