Closed mike4lyf closed 2 months ago
š¤
Hey @mike4lyf, back at it again with the tech puzzles, I see! Hope you're doing well.
Thank you for reporting this issue. It seems like the EntityExtractor
is returning a list of entities, which is causing a ValueError
when the VectorStoreIndex
tries to process the metadata. This is because VectorStoreIndex
expects metadata values to be of type string or int.
The EntityExtractor
class in the LlamaIndex codebase is used to extract entities from text and insert them into node metadata. It uses a default model tomaarsen/span-marker-mbert-base-multinerd
and the SpanMarker library for entity extraction. The entities are then returned as a list of dictionaries. Here is the relevant code:
class EntityExtractor(BaseExtractor):
...
async def aextract(self, nodes: Sequence[BaseNode]) -> List[Dict]:
# Extract node-level entity metadata
metadata_list: List[Dict] = [{} for _ in nodes]
metadata_queue: Iterable[int] = get_tqdm_iterable(
range(len(nodes)), self.show_progress, "Extracting entities"
)
for i in metadata_queue:
metadata = metadata_list[i]
node_text = nodes[i].get_content(metadata_mode=self.metadata_mode)
words = self._tokenizer(node_text)
spans = self._model.predict(words)
for span in spans:
if span["score"] > self.prediction_threshold:
ent_label = self.entity_map.get(span["label"], span["label"])
metadata_label = ent_label if self.label_entities else "entities"
if metadata_label not in metadata:
metadata[metadata_label] = set()
metadata[metadata_label].add(self.span_joiner.join(span["span"]))
# convert metadata from set to list
for metadata in metadata_list:
for key, val in metadata.items():
metadata[key] = list(val)
return metadata_list
To resolve this issue, you could modify the EntityExtractor
to return a string representation of the entities instead of a list. This can be done by joining the entities with a separator. Here is an example of how you could modify the EntityExtractor
:
# convert metadata from set to list
for metadata in metadata_list:
for key, val in metadata.items():
metadata[key] = ', '.join(val) # join entities with a comma
This will ensure that the EntityExtractor
returns a string, which can be processed by the VectorStoreIndex
.
Please let me know if this resolves your issue or if you need further assistance.
What is the current import on EntityExtractor
? And what is the supporting module? Thanks.
@mike4lyf some vector stores allow storing lists, others don't
If you need it as a string, then I recommend post-processing the metadata and converting
keeping as a list is helpful for metadata filtering in some vector dbs
Bug Description
VectorStoreIndex has metadata type issue when using EntityExtractor as a pipeline transformer. EntityExtractor returns a list, however, VectorStoreIndex expects metadata values to be string/int.
Other transformers (e.g. QuestionsAnsweredExtractor, KeywordExtractor) return strings.
Version
0.10.10
Steps to Reproduce
pipeline = IngestionPipeline(transformations=[ SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95, embed_model=OpenAIEmbedding()), TitleExtractor(nodes=5, llm=llm), QuestionsAnsweredExtractor(questions=3, llm=llm), EntityExtractor(prediction_threshold=0.5), KeywordExtractor(keywords=10, llm=llm) ]) nrma_home_nodes = pipeline.run(documents)
index = VectorStoreIndex(nrma_home_nodes, storage_context=storage_context)
Relevant Logs/Tracbacks