parthsarthi03 / raptor

The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
https://arxiv.org/abs/2401.18059
MIT License
878 stars 126 forks source link

ValueError: Input contains NaN. #3

Open chenyujiang11 opened 7 months ago

chenyujiang11 commented 7 months ago

I encountered this error when I was adding text. Hope to get a solution to deal with this error.Thank you very much. Traceback (most recent call last): File "/home/jyc23/raptor-master/demo/newdemo.py", line 123, in RA.add_documents(text) File "/home/jyc23/raptor-master/raptor/RetrievalAugmentation.py", line 217, in add_documents self.tree = self.tree_builder.build_from_text(text=docs) File "/home/jyc23/raptor-master/raptor/tree_builder.py", line 280, in build_from_text root_nodes = self.construct_tree(all_nodes, all_nodes, layer_to_nodes) File "/home/jyc23/raptor-master/raptor/cluster_tree_builder.py", line 102, in construct_tree clusters = self.clustering_algorithm.perform_clustering( File "/home/jyc23/raptor-master/raptor/cluster_utils.py", line 194, in perform_clustering clusters = perform_clustering( File "/home/jyc23/raptor-master/raptor/cluster_utils.py", line 120, in perform_clustering reduced_embeddings_global = global_cluster_embeddings(embeddings, dim) File "/home/jyc23/raptor-master/raptor/cluster_utils.py", line 32, in global_cluster_embeddings reducedembeddings = umap.UMAP( File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/umap/umap.py", line 2887, in fit_transform self.fit(X, y, force_allfinite) File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/umap/umap.py", line 2354, in fit X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C", force_all_finite=force_all_finite) File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py", line 957, in check_array _assert_all_finite( File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py", line 122, in _assert_all_finite _assert_all_finite_element_wise( File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py", line 171, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input contains NaN.

parthsarthi03 commented 7 months ago

Hey! Can you provide some more details about the text you are adding? How many tokens is it?

chenyujiang11 commented 7 months ago

Hey! Can you provide some more details about the text you are adding? How many tokens is it?

I have encountered this problem several times. The document read is sample.txt in the demo. The LLM currently used is Qwen/Qwen-1_8B-Chat-Int4, and the embedding model is BAAI/bge-small-zh-v1.5. This bug also occurred when using the demo's default embedding model multi-qa-mpnet-base-cos-v1, and the error is still reported in this place.

ExtReMLapin commented 7 months ago

Reproducing example :

import os

import torch
from raptor import BaseSummarizationModel, BaseQAModel, BaseEmbeddingModel, RetrievalAugmentationConfig
from transformers import AutoTokenizer, pipeline

from huggingface_hub import login
login()
class GEMMASummarizationModel(BaseSummarizationModel):
    def __init__(self, model_name="google/gemma-2b-it"):
        # Initialize the tokenizer and the pipeline for the GEMMA model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.summarization_pipeline = pipeline(
            "text-generation",
            model=model_name,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),  # Use "cpu" if CUDA is not available
        )

    def summarize(self, context, max_tokens=150):
        # Format the prompt for summarization
        messages=[
            {"role": "user", "content": f"Write a summary of the following, including as many key details as possible: {context}:"}
        ]

        prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

        # Generate the summary using the pipeline
        outputs = self.summarization_pipeline(
            prompt,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.95
        )

        # Extracting and returning the generated summary
        summary = outputs[0]["generated_text"].strip()
        return summary

class GEMMAQAModel(BaseQAModel):
    def __init__(self, model_name= "google/gemma-2b-it"):
        # Initialize the tokenizer and the pipeline for the model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.qa_pipeline = pipeline(
            "text-generation",
            model=model_name,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),
        )

    def answer_question(self, context, question):
        # Apply the chat template for the context and question
        messages=[
              {"role": "user", "content": f"Given Context: {context} Give the best full answer amongst the option to question {question}"}
        ]
        prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

        # Generate the answer using the pipeline
        outputs = self.qa_pipeline(
            prompt,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.95
        )

        # Extracting and returning the generated answer
        answer = outputs[0]["generated_text"][len(prompt):]
        return answer

from sentence_transformers import SentenceTransformer
class SBertEmbeddingModel(BaseEmbeddingModel):
    def __init__(self, model_name="sentence-transformers/multi-qa-mpnet-base-cos-v1"):
        self.model = SentenceTransformer(model_name)

    def create_embedding(self, text):
        return self.model.encode(text)

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

from raptor import RetrievalAugmentation, RetrievalAugmentationConfig

RAC = RetrievalAugmentationConfig(summarization_model=GEMMASummarizationModel(), qa_model=GEMMAQAModel(), embedding_model=SBertEmbeddingModel())
RA = RetrievalAugmentation(config=RAC)

with open('harry.txt', 'r', encoding="utf8") as file:
    text = file.read()
RA.add_documents(text)

SAVE_PATH = "demo/cinderella"
RA.save(SAVE_PATH)

#extract text from harry-potter-3-le-prisonnier-dazkaban.pdf

txt data linked in this message

harry.txt

daniyal214 commented 6 months ago

@parthsarthi03 I'm facing the same issue. Any update on this? @chenyujiang11 @ExtReMLapin did you guys able to resolve this?

ExtReMLapin commented 6 months ago

Didn’t retry

Amr-Hegazy1 commented 5 months ago

I had a similar issue and when I ran pip install -U sentence-transformers it worked fine

ATP-BME commented 5 months ago

it seems that the error is casued by using multiprocess when generating embeddings. set multiprocess=False and it worked fine

theta-lin commented 3 months ago

I had a similar issue and when I ran pip install -U sentence-transformers it worked fine

Installing sentence-transformers==2.2.2 as specified in requirements.txt gave me this issue. I solved it by upgrading to sentence-transformers==3.0.1.