run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.8k stars 4.75k forks source link

[Question]: Getting ValueError Invalid literal for base 10 while vectorizing #14535

Open SiddharthMurjani opened 2 weeks ago

SiddharthMurjani commented 2 weeks ago

Question Validation

Question

I am vectorizing using the below code


embed_model = llm_class(api_key=os.getenv("OPENAI_API_KEY"), deployment=model_type, model=model_type, **kwargs)
Settings.embed_model = embed_model

def save_llama(content, metadata, storage_context):
    document = Document(text=content, metadata=metadata)
    VectorStoreIndex.from_documents(
        documents=[document],
        storage_context=storage_context,
        show_progress=True
    )
dosubot[bot] commented 2 weeks ago

To resolve the ValueError: Invalid literal for base 10 error while vectorizing, ensure that the vector store is properly initialized and passed to the query. The error message indicates that a vector store is required for the vector store query.

Here is the corrected code snippet:

embed_model = llm_class(api_key=os.getenv("OPENAI_API_KEY"), deployment=model_type, model=model_type, **kwargs)
Settings.embed_model = embed_model

def save_llama(content, metadata, storage_context):
    document = Document(text=content, metadata=metadata)

    # Ensure the vector store is properly loaded or initialized
    vector_store = SomeVectorStore.load_from_disk('vector_store.json')  # Replace with actual vector store loading method

    VectorStoreIndex.from_documents(
        documents=[document],
        storage_context=storage_context,
        vector_store=vector_store,  # Pass the vector store to the query
        show_progress=True
    )

Make sure to replace SomeVectorStore.load_from_disk('vector_store.json') with the actual method to load or initialize your vector store. This should resolve the ValueError by ensuring that the vector store is provided for the vector store query [1].

To continue talking to Dosu, mention @dosu.

SiddharthMurjani commented 2 weeks ago

After debuggin I got that no index_struct is being returned.

        with self._callback_manager.as_trace("index_construction"):
            if index_struct is None:
                nodes = nodes or []
                index_struct = self.build_index_from_nodes(
                    nodes + objects  # type: ignore
                )

And that's why VectoreStoreIndex is not being initialized

class VectorStoreIndex(BaseIndex[IndexDict]):
    """Vector Store Index.

    Args:
        use_async (bool): Whether to use asynchronous calls. Defaults to False.
        show_progress (bool): Whether to show tqdm progress bars. Defaults to False.
        store_nodes_override (bool): set to True to always store Node objects in index
            store and document store even if vector store keeps text. Defaults to False
    """

    index_struct_cls = IndexDict

    def __init__(
        self,
        nodes: Optional[Sequence[BaseNode]] = None,
        # vector store index params
        use_async: bool = False,
        store_nodes_override: bool = False,
        embed_model: Optional[EmbedType] = None,
        insert_batch_size: int = 2048,
        # parent class params
        objects: Optional[Sequence[IndexNode]] = None,
        index_struct: Optional[IndexDict] = None,
        storage_context: Optional[StorageContext] = None,
        callback_manager: Optional[CallbackManager] = None,
        transformations: Optional[List[TransformComponent]] = None,
        show_progress: bool = False,
        # deprecated
        service_context: Optional[ServiceContext] = None,
        **kwargs: Any,
    ) -> None:
        """Initialize params."""
        self._use_async = use_async
        self._store_nodes_override = store_nodes_override
        self._embed_model = (
            resolve_embed_model(embed_model, callback_manager=callback_manager)
            if embed_model
            else embed_model_from_settings_or_context(Settings, service_context)
        )

        self._insert_batch_size = insert_batch_size
        super().__init__(
            nodes=nodes,
            **index_struct=index_struct,**
            service_context=service_context,
            storage_context=storage_context,
            show_progress=show_progress,
            objects=objects,
            callback_manager=callback_manager,
            transformations=transformations,
            **kwargs,
        )
logan-markewich commented 2 weeks ago

I don't think this is related to anything with VectoreStoreIndex -- I think your embedding model is not returning raw floats for embedding values.

I don't know what llm_class is doing, but that is likely the issue.

You can easily check with embeddings = embed_model.get_text_embedding("Hello world") and ensure the returned type is a list of float. Seems like it might be returning numpy or something else