parthsarthi03 / raptor

The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
https://arxiv.org/abs/2401.18059
MIT License
878 stars 126 forks source link

TypeError: expected string or buffer #43

Open LeonMing30 opened 4 months ago

LeonMing30 commented 4 months ago

I tried to run demo code for testing, but there is the error.

`
from raptor import RetrievalAugmentation

RA = RetrievalAugmentation()

with open('demo/sample.txt', 'r') as file:
    text = file.read()
RA.add_documents(text)
question = "How did Cinderella reach her happy ending?"
answer = RA.answer_question(question=question)
print("Answer: ", answer)`
Traceback (most recent call last):
  File "D:\Code\Python\20240531\RAPTOR\raptor\demotest.py", line 13, in <module>
    RA.add_documents(text)
  File "D:\Code\Python\20240531\RAPTOR\raptor\raptor\RetrievalAugmentation.py", line 219, in add_documents
    self.tree = self.tree_builder.build_from_text(text=docs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Code\Python\20240531\RAPTOR\raptor\raptor\tree_builder.py", line 291, in build_from_text
    root_nodes = self.construct_tree(all_nodes, all_nodes, layer_to_nodes)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Code\Python\20240531\RAPTOR\raptor\raptor\cluster_tree_builder.py", line 130, in construct_tree
    process_cluster(
  File "D:\Code\Python\20240531\RAPTOR\raptor\raptor\cluster_tree_builder.py", line 77, in process_cluster
    f"Node Texts Length: {len(self.tokenizer.encode(node_texts))}, Summarized Text Length: {len(self.tokenizer.encode(summarized_text))}"
                                                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Code\Python\20240531\RAPTOR\venv\Lib\site-packages\tiktoken\core.py", line 116, in encode
    if match := _special_token_regex(disallowed_special).search(text):
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected string or buffer

How can I fix it?

parthsarthi03 commented 4 months ago

Hey! I am not able to reproduce the above bug. Can you print out the text before RA.add_documents()and also print out RA.tree_builder.summarization_model to make sure that these models are set correctly.

theta-lin commented 3 months ago

@LeonMing30 Hi, I encountered the same issue as you before I realized that there's a mistake on my side. I used a custom summarization model whose output is not a simple string but a dictionary containing both the output string and some other metadata. Therefore, I also suggest you try calling the summarize() method of the model you are using and check if the return value is actually the chat output of the LLM.

yyyf-g commented 2 months ago

I encountered the same problem