Open 912100012 opened 1 week ago
The TopicNodeParser
creates separate nodes for the summary and the original text. The summary is stored in an IndexNode
, while the original text is stored in a TextNode
. The TextNode
includes both the summary and the original content serialized as markdown, allowing you to access the original text from there [1].
To continue talking to Dosu, mention @dosu.
watch my code! The "nodes" I got is the the TextNode type list,ech node only has the text after llm summary,no original text,so where is the "The summary is stored in an IndexNode, while the original text is stored in a TextNode" you said?
import os
from llama_index.core import Document
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.node_parser.topic import TopicNodeParser
from demo.TopicNodeParse.CustomEntityExtractor import CustomEntityExtractor
from pro_settings.my_settings import MySetting
os.environ["https_proxy"] = "http://localhost:60960"
os.environ["OPENAI_API_KEY"] = "sk-proj-NcTT4syiWlE3mQeVbw9kT3BlbkFJGg6JwiFERq7S4tLePJal"
embed_model = OpenAIEmbedding()
llm = OpenAI(model="gpt-4o-mini")
MySetting.open_log()
documents = MySetting.get_test_doc(file='../../data/绿毛水怪mini.txt')
node_parser = TopicNodeParser.from_defaults(
llm=llm,
max_chunk_size=1000,
similarity_method="llm", # can be "llm" or "embedding"
window_size=5, # paper suggests window_size=5
)
text = '''
好,在小学的一班学生之中,有了一个“怪物”就够了吧,但是事情偏不如此。班上还有个女生,也是一样的精灵古怪,因为她太精,她妈管她叫“人妖”。
这个称呼就被同学当作她的外号了。当然了,一般来说,叫一个女生的外号是很下流的。因此她的外号就变成了一个不算难听的昵称“妖妖”。这样就被叫开了,她自己也不很反感。
喂,你不要笑,我知道你现在一定猜出了她就是那个水怪杨素瑶。你千万不要以为我会给你讲一个杜撰的故事,说她天天夜里骑着笤帚上天。
这样事情是不会有的,而我给你讲的是一件真事呢。我记得有那么一天,班上来了一位新老师,原来我们的班主任孙老师升了教导主任了,我们都在感谢上苍:老天有眼,把我们从一位阎王爷手底下救出来了。
'''
documents = [Document(text=text)]
entity_extractor = CustomEntityExtractor()
transformations = [node_parser]
docstore = SimpleDocumentStore()
from llama_index.core.ingestion import IngestionPipeline
pipeline = IngestionPipeline(transformations=transformations, docstore=docstore)
nodes = pipeline.run(documents=documents)
print(nodes)
@dosu
The TopicNodeParser
processes documents into topic-based nodes, and the resulting nodes typically contain summarized text. If you're only seeing summaries in your TextNode
instances, it might be due to how the TopicNodeParser
is configured or how the nodes are being processed in your pipeline.
To access the original text, ensure that the TextNode
includes both the summary and the original content. The TextNode
class has methods to handle content and metadata, which might be useful for retrieving the original text. Check if the get_content
method or similar methods are being used to extract the text, as they might be configured to return only the summary.
If the original text is not being stored as expected, you might need to adjust the configuration or processing logic to ensure both the summary and original text are retained in the nodes.
To continue talking to Dosu, mention @dosu.
Question Validation
Question
After using the TopicNodeParser,I found that the text property of node(chunk) is the summary by llm. But i want find the original text from the document to retrieve.Which position it save,please tell me.