run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.46k stars 5k forks source link

UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U16')) -> None #1000

Closed gisyrus closed 1 year ago

gisyrus commented 1 year ago

Hi,

I am using llama-index version 0.5.3 When I tried to query an index which used ElasticSearchReader to generate the documents, it gives out the following error:

---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
Cell In[65], line 3
      1 # from llama_index.indices.query.query_runner import base
      2 prompt = 'What is the description of the ioc_type?'
----> 3 response = index.query(prompt, use_async=True, mode="embedding")
      4 print(response)

File ~/Documents/gitlab-local/EmbeddingAI/env/lib/python3.8/site-packages/llama_index/indices/base.py:244, in BaseGPTIndex.query(self, query_str, mode, query_transform, use_async, **query_kwargs)
    230 query_config = QueryConfig(
    231     index_struct_type=self._index_struct.get_type(),
    232     query_mode=mode_enum,
    233     query_kwargs=query_kwargs,
    234 )
    235 query_runner = QueryRunner(
    236     index_struct=self._index_struct,
    237     service_context=self._service_context,
   (...)
    242     use_async=use_async,
    243 )
--> 244 return query_runner.query(query_str)

File ~/Documents/gitlab-local/EmbeddingAI/env/lib/python3.8/site-packages/llama_index/indices/query/query_runner.py:341, in QueryRunner.query(self, query_str_or_bundle, index_id, level)
    323 """Run query.
    324 
...
     45     return product / norm

File <__array_function__ internals>:200, in dot(*args, **kwargs)

UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U16')) -> None

I used the following json as the raw data: example.json.zip

And the generated index: example_model.json.zip

I used GPTSimpleVectorIndex to generate the index.

bobbyng626 commented 1 year ago

The UFuncTypeError is generally specifying the data type mismatch, from the input. The index used to query the prompt might be incorrect, or None in value, which caused the embedding index check recursively getting None value. One possible problem of this error occur is the incorrect setting of the data loaded to the document. You better check the ElasticSearchReader is loading the correct set of data with correct parameter. My code works fine as following:

# Define the query criteria, it returns all items in the example data
query_dict = {'query': {'match_all': {}}}
# Load the ElasticSearchReader data to the document list
documents = reader.load_data(
    field="id", query=query_dict
)
# Prepare the index for query the prompt, and can be exported for reuse
index = GPTSimpleVectorIndex(documents, chunk_size_limit=500)

# Query the prompt
prompt = 'What is the item id which first seen is on 2022-04-12 18:31:56 UTC'
response = index.query(prompt, use_async=True, mode="embedding")
print(response)

You should see the document object with embedding=None

Document(text='518944', doc_id='68e6a0dd-d913-42fa-bf1e-01d728903429', embedding=None, doc_hash='8c31a906be5a1910c2c76c09e09145fcf34184470d571f9837ccb8a75cb08052', extra_info={'id': '518944', 'ioc': '111.167.1.44:46171', 'threat_type': 'botnet_cc', 'threat_type_desc': 'Indicator that identifies a botnet command&control server (C&C)', 'ioc_type': 'url', 'ioc_type_desc': 'URL that is used for botnet Command&control (C&C)', 'malware': 'elf.mozi', 'malware_printable': 'Mozi', 'malware_alias': None, 'malware_malpedia': 'https://malpedia.caad.fkie.fraunhofer.de/details/elf.mozi', 'confidence_level': 100, 'first_seen': '2022-04-12 18:31:56 UTC', 'last_seen': None, 'reference': None, 'reporter': 'fish_illuminati', 'tags': ['elf', 'Mozi']})

And you should see a large number of token used for embedding. Total embedding token usage: 29133 tokens

2023-04-12 17:09:18,474 P17208T20656 INFO <llama_index.token_counter.token_counter:token_counter.py/wrapper_logic L60> | > [build_index_from_nodes] Total LLM token usage: 0 tokens
2023-04-12 17:09:18,475 P17208T20656 INFO <llama_index.token_counter.token_counter:token_counter.py/wrapper_logic L63> | > [build_index_from_nodes] Total embedding token usage: 29133 tokens

These indicates you have successfully activated OpenAI Embedding API and generated the embedding index for your data, which is essential for query. The result of my prompt is like this:

What is the item id which first seen is on 2022-04-12 18:31:56 UTC
2023-04-12 17:22:35,995 P17208T20656 INFO <llama_index.token_counter.token_counter:token_counter.py/wrapper_logic L60> | > [query] Total LLM token usage: 251 tokens
2023-04-12 17:22:35,996 P17208T20656 INFO <llama_index.token_counter.token_counter:token_counter.py/wrapper_logic L63> | > [query] Total embedding token usage: 21 tokens

The item id is 518945.
dosubot[bot] commented 1 year ago

Hi, @gisyrus. I'm helping the LlamaIndex team manage their backlog and I wanted to let you know that we are marking this issue as stale.

Based on the information provided, it seems that you encountered a UFuncTypeError when querying an index using LlamaIndex version 0.5.3. User bobbyng626 suggested that the error may be caused by an incorrect setting of the data loaded to the document. They provided an example code snippet that worked for them and suggested checking the ElasticSearchReader for correct data loading.

Before we close this issue, we wanted to check with you if this issue is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LlamaIndex project.