run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.65k stars 5.25k forks source link

[Bug]: #11348

Closed Ray0907 closed 5 months ago

Ray0907 commented 8 months ago

Bug Description

I tried to use sample code from the doc sample However, I encountered the error mentioned below. RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Version

0.10.12

Steps to Reproduce

Here's my code.

import torch
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

llm = HuggingFaceLLM(
    context_window=2048,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.25, "do_sample": False},
     tokenizer_name="Writer/camel-5b-hf",
    model_name="Writer/camel-5b-hf",
    device_map="auto",
    tokenizer_kwargs={"max_length": 2048},
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.float16}
)

Settings.chunk_size = 512
Settings.llm = llm
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-mpnet-base-v2")

Relevant Logs/Tracbacks

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[2], line 3
      1 torch.backends.cudnn.enable =True
      2 documents = SimpleDirectoryReader("./data").load_data()
----> 3 index = VectorStoreIndex.from_documents(documents)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\indices\base.py:142, in BaseIndex.from_documents(cls, documents, storage_context, show_progress, callback_manager, transformations, service_context, **kwargs)
    133     docstore.set_document_hash(doc.get_doc_id(), doc.hash)
    135 nodes = run_transformations(
    136     documents,  # type: ignore
    137     transformations,
    138     show_progress=show_progress,
    139     **kwargs,
    140 )
--> 142 return cls(
    143     nodes=nodes,
    144     storage_context=storage_context,
    145     callback_manager=callback_manager,
    146     show_progress=show_progress,
    147     transformations=transformations,
    148     service_context=service_context,
    149     **kwargs,
    150 )

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\indices\vector_store\base.py:74, in VectorStoreIndex.__init__(self, nodes, use_async, store_nodes_override, embed_model, insert_batch_size, objects, index_struct, storage_context, callback_manager, transformations, show_progress, service_context, **kwargs)
     67 self._embed_model = (
     68     resolve_embed_model(embed_model, callback_manager=callback_manager)
     69     if embed_model
     70     else embed_model_from_settings_or_context(Settings, service_context)
     71 )
     73 self._insert_batch_size = insert_batch_size
---> 74 super().__init__(
     75     nodes=nodes,
     76     index_struct=index_struct,
     77     service_context=service_context,
     78     storage_context=storage_context,
     79     show_progress=show_progress,
     80     objects=objects,
     81     callback_manager=callback_manager,
     82     transformations=transformations,
     83     **kwargs,
     84 )

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\indices\base.py:91, in BaseIndex.__init__(self, nodes, objects, index_struct, storage_context, callback_manager, transformations, show_progress, service_context, **kwargs)
     89 if index_struct is None:
     90     nodes = nodes or []
---> 91     index_struct = self.build_index_from_nodes(
     92         nodes + objects  # type: ignore
     93     )
     94 self._index_struct = index_struct
     95 self._storage_context.index_store.add_index_struct(self._index_struct)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\indices\vector_store\base.py:307, in VectorStoreIndex.build_index_from_nodes(self, nodes, **insert_kwargs)
    299 if any(
    300     node.get_content(metadata_mode=MetadataMode.EMBED) == "" for node in nodes
    301 ):
    302     raise ValueError(
    303         "Cannot build index from nodes with no content. "
    304         "Please ensure all nodes have content."
    305     )
--> 307 return self._build_index_from_nodes(nodes, **insert_kwargs)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\indices\vector_store\base.py:279, in VectorStoreIndex._build_index_from_nodes(self, nodes, **insert_kwargs)
    277     run_async_tasks(tasks)
    278 else:
--> 279     self._add_nodes_to_index(
    280         index_struct,
    281         nodes,
    282         show_progress=self._show_progress,
    283         **insert_kwargs,
    284     )
    285 return index_struct

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\indices\vector_store\base.py:232, in VectorStoreIndex._add_nodes_to_index(self, index_struct, nodes, show_progress, **insert_kwargs)
    229     return
    231 for nodes_batch in iter_batch(nodes, self._insert_batch_size):
--> 232     nodes_batch = self._get_node_with_embedding(nodes_batch, show_progress)
    233     new_ids = self._vector_store.add(nodes_batch, **insert_kwargs)
    235     if not self._vector_store.stores_text or self._store_nodes_override:
    236         # NOTE: if the vector store doesn't store text,
    237         # we need to add the nodes to the index struct and document store

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\indices\vector_store\base.py:140, in VectorStoreIndex._get_node_with_embedding(self, nodes, show_progress)
    129 def _get_node_with_embedding(
    130     self,
    131     nodes: Sequence[BaseNode],
    132     show_progress: bool = False,
    133 ) -> List[BaseNode]:
    134     """Get tuples of id, node, and embedding.
    135 
    136     Allows us to store these nodes in a vector store.
    137     Embeddings are called in batches.
    138 
    139     """
--> 140     id_to_embed_map = embed_nodes(
    141         nodes, self._embed_model, show_progress=show_progress
    142     )
    144     results = []
    145     for node in nodes:

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\indices\utils.py:138, in embed_nodes(nodes, embed_model, show_progress)
    135     else:
    136         id_to_embed_map[node.node_id] = node.embedding
--> 138 new_embeddings = embed_model.get_text_embedding_batch(
    139     texts_to_embed, show_progress=show_progress
    140 )
    142 for new_id, text_embedding in zip(ids_to_embed, new_embeddings):
    143     id_to_embed_map[new_id] = text_embedding

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\core\base\embeddings\base.py:255, in BaseEmbedding.get_text_embedding_batch(self, texts, show_progress, **kwargs)
    249 if idx == len(texts) - 1 or len(cur_batch) == self.embed_batch_size:
    250     # flush
    251     with self.callback_manager.event(
    252         CBEventType.EMBEDDING,
    253         payload={EventPayload.SERIALIZED: self.to_dict()},
    254     ) as event:
--> 255         embeddings = self._get_text_embeddings(cur_batch)
    256         result_embeddings.extend(embeddings)
    257         event.on_end(
    258             payload={
    259                 EventPayload.CHUNKS: cur_batch,
    260                 EventPayload.EMBEDDINGS: embeddings,
    261             },
    262         )

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\embeddings\huggingface\base.py:202, in HuggingFaceEmbedding._get_text_embeddings(self, texts)
    198 """Get text embeddings."""
    199 texts = [
    200     format_text(text, self.model_name, self.text_instruction) for text in texts
    201 ]
--> 202 return self._embed(texts)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\embeddings\huggingface\base.py:161, in HuggingFaceEmbedding._embed(self, sentences)
    156 # move tokenizer inputs to device
    157 encoded_input = {
    158     key: val.to(self._device) for key, val in encoded_input.items()
    159 }
--> 161 model_output = self._model(**encoded_input)
    163 if self.pooling == Pooling.CLS:
    164     context_layer: "torch.Tensor" = model_output[0]

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\mpnet\modeling_mpnet.py:550, in MPNetModel.forward(self, input_ids, attention_mask, position_ids, head_mask, inputs_embeds, output_attentions, output_hidden_states, return_dict, **kwargs)
    547 extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)
    549 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
--> 550 embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, inputs_embeds=inputs_embeds)
    551 encoder_outputs = self.encoder(
    552     embedding_output,
    553     attention_mask=extended_attention_mask,
   (...)
    557     return_dict=return_dict,
    558 )
    559 sequence_output = encoder_outputs[0]

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\mpnet\modeling_mpnet.py:112, in MPNetEmbeddings.forward(self, input_ids, position_ids, inputs_embeds, **kwargs)
    109 position_embeddings = self.position_embeddings(position_ids)
    111 embeddings = inputs_embeds + position_embeddings
--> 112 embeddings = self.LayerNorm(embeddings)
    113 embeddings = self.dropout(embeddings)
    114 return embeddings

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\normalization.py:201, in LayerNorm.forward(self, input)
    200 def forward(self, input: Tensor) -> Tensor:
--> 201     return F.layer_norm(
    202         input, self.normalized_shape, self.weight, self.bias, self.eps)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\functional.py:2546, in layer_norm(input, normalized_shape, weight, bias, eps)
   2542 if has_torch_function_variadic(input, weight, bias):
   2543     return handle_torch_function(
   2544         layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, bias=bias, eps=eps
   2545     )
-> 2546 return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
dosubot[bot] commented 8 months ago

🤖

Hey @Ray0907, great to see you back! Hope you're doing well amidst all the coding adventures. Let's tackle this new challenge together.

Based on the information you've provided and the similar issue I found in the LlamaIndex repository, it seems that the CUDA error you're encountering might be due to the max_length parameter in the HuggingFaceEmbedding model constructor.

In the issue I found, the user was able to resolve a similar CUDA error by setting max_length=256 in the constructor of the embedding model. This adjustment might help in your case as well.

The max_length parameter in the HuggingFaceEmbedding model constructor is used to define the maximum length of the input that will be passed to the tokenizer. If the input exceeds this length, it will be truncated to fit within the specified max_length. This is important as it ensures that the input fits within the model's limitations and doesn't cause any errors during the tokenization or model inference process.

Here's how you can adjust your code:

Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-mpnet-base-v2", max_length=256)

As for the device_map parameter in the HuggingFaceLLM constructor, it's used to specify the device where the model will be loaded. If it's set to "auto", the model will be loaded on the GPU if one is available, otherwise it will be loaded on the CPU. This parameter interacts with CUDA by determining whether the model will be loaded on the GPU (which uses CUDA for computations) or the CPU.

I hope this helps! If you continue to encounter issues, please provide more details about your environment (e.g., CUDA version, PyTorch version, GPU model) so we can better assist you.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
logan-markewich commented 8 months ago

The error is in the embeddings. Probably some max length issue

Try this Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-mpnet-base-v2", max_length=512)

Ray0907 commented 8 months ago

@logan-markewich its worked. thanks for help.but new question is how can I get longer response. Is any paramete setting could i find in the docs?