run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.74k stars 5.27k forks source link

EntityExtractor #13774

Open erlebach opened 5 months ago

erlebach commented 5 months ago

Question Validation

Question

Why isn't the EntityExtractor implemented inside metadata_extractors.py rather than in its current location: llama-index-integrations/extractors/llama-index-extractors-entity/llama_index/extractors? I am having issues importing the EntityExtractor using poetry.

logan-markewich commented 5 months ago

@erlebach because it relies on a rather heavy dependency, span-marker and transformers.

You can install and import it

pip install llama-index-extractors-entity

from llama_index.extractors.entity import EntityExtractor

erlebach commented 5 months ago

Please explain the following error. Here is the code:

from llama_index.core.node_parser import SentenceSplitter
from llama_index.extractors.entity import EntityExtractor

from headers import (
    SimpleDirectoryReader,
    Ollama,
    Settings,
)

# Create an instance of Ollama with the specified parameters
llm = Ollama(model="phi3:latest", request_timeout=600.0, temperature=0.0)
Settings.llm = llm

reader = SimpleDirectoryReader('files')
documents = reader.load_data()
parser = SentenceSplitter(include_prev_next_rel=True)
nodes = parser.get_nodes_from_documents(documents)

entity_extractor = EntityExtractor(
    label_entities=True,
    device="cpu"
)

metadata_list = entity_extractor.extract(nodes) # ERROR
print(metadata_list)

and the error, which occurs on the line where entity_extractor.extract(nodes) is executed:

python sample_extractor_EntityExtractor.py
/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Extracting entities:   0%|                                                         | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/ch4/sample_extractor_EntityExtractor.py", line 37, in <module>
    metadata_list = entity_extractor.extract(nodes)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/.venv/lib/python3.12/site-packages/llama_index/core/extractors/interface.py", line 96, in extract
    return asyncio_run(self.aextract(nodes))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/.venv/lib/python3.12/site-packages/llama_index/core/async_utils.py", line 31, in asyncio_run
    return loop.run_until_complete(coro)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/lib/python3.12/asyncio/base_events.py", line 684, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/.venv/lib/python3.12/site-packages/llama_index/extractors/entity/base.py", line 136, in aextract
    spans = self._model.predict(words)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/.venv/lib/python3.12/site-packages/span_marker/modeling.py", line 512, in predict
    output = self(**batch)
             ^^^^^^^^^^^^^
  File "/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/.venv/lib/python3.12/site-packages/span_marker/modeling.py", line 153, in forward
    outputs = self.encoder(
              ^^^^^^^^^^^^^
  File "/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/.venv/lib/python3.12/site-packages/transformers/models/bert/modeling_bert.py", line 1103, in forward
    extended_attention_mask = _prepare_4d_attention_mask_for_sdpa(
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erlebach/src/2024/llama_index_gordon/Building-Data-Driven-Applications-with-LlamaIndex/.venv/lib/python3.12/site-packages/transformers/modeling_attn_mask_utils.py", line 439, in _prepare_4d_attention_mask_for_sdpa
    batch_size, key_value_length = mask.shape
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: too many values to unpack (expected 2)
ghost commented 5 months ago

It appears that an update to the transformers library caused this issue. The version of transformers likely differs from the one used when llama-index-extractors-entity was released, as span_marker depends on transformers>=4.19.0.

I mitigated this issue by downgrading to transformers==4.40.2, as the problem occurs starting from version 4.41.0.

this error also occured on llama-index-entity-example . I also mitigated it.

zwei2016 commented 3 months ago

Problem still exists in transformer 4.43.3.

nathanw9722 commented 1 month ago

Still exists in transformers 4.44.2 as well. All metadata extractor samples seem to fail because of this issue.