[Bug]: EntityExtractor returns list rather than expected string

mike4lyf commented 8 months ago

Bug Description

VectorStoreIndex has metadata type issue when using EntityExtractor as a pipeline transformer. EntityExtractor returns a list, however, VectorStoreIndex expects metadata values to be string/int.

Other transformers (e.g. QuestionsAnsweredExtractor, KeywordExtractor) return strings.

Version

0.10.10

Steps to Reproduce

pipeline = IngestionPipeline(transformations=[ SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95, embed_model=OpenAIEmbedding()), TitleExtractor(nodes=5, llm=llm), QuestionsAnsweredExtractor(questions=3, llm=llm), EntityExtractor(prediction_threshold=0.5), KeywordExtractor(keywords=10, llm=llm) ]) nrma_home_nodes = pipeline.run(documents)

index = VectorStoreIndex(nrma_home_nodes, storage_context=storage_context)

Relevant Logs/Tracbacks

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-81-d5bf8332a4b1> in <cell line: 7>()
      5 vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
      6 storage_context = StorageContext.from_defaults(vector_store=vector_store)
----> 7 index = VectorStoreIndex(nrma_home_nodes, storage_context=storage_context)

7 frames

/usr/local/lib/python3.10/dist-packages/llama_index/core/indices/vector_store/base.py in __init__(self, nodes, use_async, store_nodes_override, embed_model, insert_batch_size, objects, index_struct, storage_context, callback_manager, transformations, show_progress, service_context, **kwargs)
     72 
     73         self._insert_batch_size = insert_batch_size
---> 74         super().__init__(
     75             nodes=nodes,
     76             index_struct=index_struct,

/usr/local/lib/python3.10/dist-packages/llama_index/core/indices/base.py in __init__(self, nodes, objects, index_struct, storage_context, callback_manager, transformations, show_progress, service_context, **kwargs)
     89             if index_struct is None:
     90                 nodes = nodes or []
---> 91                 index_struct = self.build_index_from_nodes(
     92                     nodes + objects  # type: ignore
     93                 )

/usr/local/lib/python3.10/dist-packages/llama_index/core/indices/vector_store/base.py in build_index_from_nodes(self, nodes, **insert_kwargs)
    305             )
    306 
--> 307         return self._build_index_from_nodes(nodes, **insert_kwargs)
    308 
    309     def _insert(self, nodes: Sequence[BaseNode], **insert_kwargs: Any) -> None:

/usr/local/lib/python3.10/dist-packages/llama_index/core/indices/vector_store/base.py in _build_index_from_nodes(self, nodes, **insert_kwargs)
    277             run_async_tasks(tasks)
    278         else:
--> 279             self._add_nodes_to_index(
    280                 index_struct,
    281                 nodes,

/usr/local/lib/python3.10/dist-packages/llama_index/core/indices/vector_store/base.py in _add_nodes_to_index(self, index_struct, nodes, show_progress, **insert_kwargs)
    231         for nodes_batch in iter_batch(nodes, self._insert_batch_size):
    232             nodes_batch = self._get_node_with_embedding(nodes_batch, show_progress)
--> 233             new_ids = self._vector_store.add(nodes_batch, **insert_kwargs)
    234 
    235             if not self._vector_store.stores_text or self._store_nodes_override:

/usr/local/lib/python3.10/dist-packages/llama_index/vector_stores/chroma/base.py in add(self, nodes, **add_kwargs)
    236             for node in node_chunk:
    237                 embeddings.append(node.get_embedding())
--> 238                 metadata_dict = node_to_metadata_dict(
    239                     node, remove_text=True, flat_metadata=self.flat_metadata
    240                 )

/usr/local/lib/python3.10/dist-packages/llama_index/core/vector_stores/utils.py in node_to_metadata_dict(node, remove_text, text_field, flat_metadata)
     41 
     42     if flat_metadata:
---> 43         _validate_is_flat_dict(metadata)
     44 
     45     # store entire node as json string - some minor text duplication

/usr/local/lib/python3.10/dist-packages/llama_index/core/vector_stores/utils.py in _validate_is_flat_dict(metadata_dict)
     25             raise ValueError("Metadata key must be str!")
     26         if not isinstance(val, (str, int, float, type(None))):
---> 27             raise ValueError(
     28                 f"Value for metadata {key} must be one of (str, int, float, None)"
     29             )

ValueError: Value for metadata entities must be one of (str, int, float, None)

dosubot[bot] commented 8 months ago

🤖

Hey @mike4lyf, back at it again with the tech puzzles, I see! Hope you're doing well.

Thank you for reporting this issue. It seems like the EntityExtractor is returning a list of entities, which is causing a ValueError when the VectorStoreIndex tries to process the metadata. This is because VectorStoreIndex expects metadata values to be of type string or int.

The EntityExtractor class in the LlamaIndex codebase is used to extract entities from text and insert them into node metadata. It uses a default model tomaarsen/span-marker-mbert-base-multinerd and the SpanMarker library for entity extraction. The entities are then returned as a list of dictionaries. Here is the relevant code:

class EntityExtractor(BaseExtractor):
    ...
    async def aextract(self, nodes: Sequence[BaseNode]) -> List[Dict]:
        # Extract node-level entity metadata
        metadata_list: List[Dict] = [{} for _ in nodes]
        metadata_queue: Iterable[int] = get_tqdm_iterable(
            range(len(nodes)), self.show_progress, "Extracting entities"
        )
        for i in metadata_queue:
            metadata = metadata_list[i]
            node_text = nodes[i].get_content(metadata_mode=self.metadata_mode)
            words = self._tokenizer(node_text)
            spans = self._model.predict(words)
            for span in spans:
                if span["score"] > self.prediction_threshold:
                    ent_label = self.entity_map.get(span["label"], span["label"])
                    metadata_label = ent_label if self.label_entities else "entities"

                    if metadata_label not in metadata:
                        metadata[metadata_label] = set()

                    metadata[metadata_label].add(self.span_joiner.join(span["span"]))

        # convert metadata from set to list
        for metadata in metadata_list:
            for key, val in metadata.items():
                metadata[key] = list(val)

        return metadata_list

Source: llama-index-integrations/extractors/llama-index-extractors-entity/llama_index/extractors/entity/base.py

To resolve this issue, you could modify the EntityExtractor to return a string representation of the entities instead of a list. This can be done by joining the entities with a separator. Here is an example of how you could modify the EntityExtractor:

# convert metadata from set to list
for metadata in metadata_list:
    for key, val in metadata.items():
        metadata[key] = ', '.join(val)  # join entities with a comma

This will ensure that the EntityExtractor returns a string, which can be processed by the VectorStoreIndex.

Please let me know if this resolves your issue or if you need further assistance.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

erlebach commented 5 months ago

What is the current import on EntityExtractor? And what is the supporting module? Thanks.

logan-markewich commented 2 months ago

@mike4lyf some vector stores allow storing lists, others don't

If you need it as a string, then I recommend post-processing the metadata and converting

keeping as a list is helpful for metadata filtering in some vector dbs

run-llama / llama_index