jina-ai / serve

☁️ Build multimodal AI applications with cloud-native stack
https://jina.ai/serve
Apache License 2.0
21.13k stars 2.22k forks source link

Colab pdf_search_2.ipynb 'index_sentenizer' failed to start #5548

Closed themantalope closed 1 year ago

themantalope commented 1 year ago

Describe the bug During execution of the colab example notebook I'm getting a cryptic error when trying to run the indexing flow.

Describe how you solve it I am unable to solve it.


- jina
  3.13.0
- docarray
  0.20.1
- jcloud
  0.1.6
- jina-
  hubble-
  sdk
  0.29.0
- jina-
  proto
  0.1.13
- protobuf
  4.21.12
- proto-
  backend
  upb
- grpcio
  1.47.2
- pyyaml
  6.0
- python
  3.8.16
- platform
  Linux
- platform-
  release
  5.10.133+
- platform-
  version
  #1 SMP
  Fri Aug
  26
  08:44:51
  UTC 2022
- architect
  ure
  x86_64
- processor
  x86_64
- uid 24853
  78613260
- session-
  id 812991
  98-8129-1
  1ed-9f4a-
  0242ac1c0
  00c
- uptime 20
  22-12-21T
  12:17:56.
  796261
- ci-vendor
  (unset)
- internal
  False
* JINA_DEFA
  ULT_HOST
  (unset)
* JINA_DEFA
  ULT_TIMEO
  UT_CTRL
  (unset)
* JINA_DEPL
  OYMENT_NA
  ME
  (unset)
* JINA_DISA
  BLE_UVLOO
  P (unset)
* JINA_EARL
  Y_STOP
  (unset)
* JINA_FULL
  _CLI
  (unset)
* JINA_GATE
  WAY_IMAGE
  (unset)
* JINA_GRPC
  _RECV_BYT
  ES
  (unset)
* JINA_GRPC
  _SEND_BYT
  ES
  (unset)
* JINA_HUB_
  NO_IMAGE_
  REBUILD
  (unset)
* JINA_LOG_
  CONFIG
  (unset)
* JINA_LOG_
  LEVEL
  (unset)
* JINA_LOG_
  NO_COLOR
  (unset)
* JINA_MP_S
  TART_METH
  OD
  (unset)
* JINA_OPTO
  UT_TELEME
  TRY
  (unset)
* JINA_RAND
  OM_PORT_M
  AX
  (unset)
* JINA_RAND
  OM_PORT_M
  IN
  (unset)
* JINA_LOCK
  S_ROOT
  (unset)
* JINA_K8S_
  ACCESS_MO
  DES
  (unset)
* JINA_K8S_
  STORAGE_C
  LASS_NAME
  (unset)
* JINA_K8S_
  STORAGE_C
  APACITY
  (unset)
* JINA_STRE
  AMER_ARGS
  (unset)
JoanFM commented 1 year ago

Hello @themantalope ,

can u please provide more context about the problem you are facing?

themantalope commented 1 year ago

@JoanFM

Thanks for following up. Initially there were no additional messages other than the index_sentenizer failed to start. I eventually say some other warnings stating that spaCy was not installed. Running pip install -q spacy at the beginning of the notebook solved the issue.

themantalope commented 1 year ago

@JoanFM

This actually is still an issue. Even when explicitly running pip install spacy, i still get a warning when the flow is set up stating: ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy. Additionally, I am also (sometimes) getting an error regarding protobuf:

WARNI… JINA@108 Error getting the directory name from jinahub://PDFTableExtractor/latest.       [12/22/22 11:28:48]
       `--install-requirements` option is only valid when `uses` is a configuration file.                          

🔐 You are not logged in to Jina AI. To log in, use jina auth login or set env variable JINA_AUTH_TOKEN.

WARNI… JINA@108 Error getting the directory name from jinahub://PDFSegmenter.                   [12/22/22 11:28:53]
       `--install-requirements` option is only valid when `uses` is a configuration file.                          

WARNI… JINA@108 Error getting the directory name from jinahub://SpacySentencizer.               [12/22/22 11:29:20]
       `--install-requirements` option is only valid when `uses` is a configuration file.                          

WARNI… JINA@108 Error getting the directory name from                                           [12/22/22 11:29:24]
       jinahub://ImagePreprocessor-skip-non-images. `--install-requirements` option is only                        
       valid when `uses` is a configuration file.                                                                  

WARNI… JINA@108 Error getting the directory name from                                           [12/22/22 11:29:25]
       jinahub://ImagePreprocessor-skip-non-images. `--install-requirements` option is only                        
       valid when `uses` is a configuration file.                                                                  

WARNI… JINA@108 Error getting the directory name from jinahub://CLIPEncoder/latest-gpu.                            
       `--install-requirements` option is only valid when `uses` is a configuration file.                          

WARNI… JINA@108 Error getting the directory name from jinahub://CLIPEncoder/latest-gpu.         [12/22/22 11:30:48]
       `--install-requirements` option is only valid when `uses` is a configuration file.                          

WARNI… JINA@108 Error getting the directory name from jinahub://AnnLiteIndexer.                 [12/22/22 11:30:50]
       `--install-requirements` option is only valid when `uses` is a configuration file.                          

ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.

Downloading: 100%
577M/577M [00:26<00:00, 24.2MB/s]

ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

[/usr/local/lib/python3.8/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in _dep_map(self)
   3015         try:
-> 3016             return self.__dep_map
   3017         except AttributeError:

28 frames

AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)

AttributeError: _pkg_info

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)

[/usr/local/lib/python3.8/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in _get(self, path)
   1609 
   1610     def _get(self, path):
-> 1611         with open(path, 'rb') as stream:
   1612             return stream.read()
   1613 

FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.8/dist-packages/protobuf-3.19.6.dist-info/METADATA'
JoanFM commented 1 year ago

can u set the environment variable JINA_LOG_LEVEL to DEBUG,

and share the exact cell that causes the error with the exact traceback?

themantalope commented 1 year ago

The flow takes a very long time to start up after turning the log level to debug. The process has been running for 10+ minutes, this is what I have so far:

WARNI… JINA@110763 Error getting the directory name from jinahub://PDFTableExtractor/latest.    [12/22/22 14:03:18]
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
WARNI… JINA@110763 Error getting the directory name from jinahub://PDFSegmenter.                                   
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
WARNI… JINA@110763 Error getting the directory name from jinahub://SpacySentencizer.            [12/22/22 14:03:19]
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
WARNI… JINA@110763 Error getting the directory name from                                        [12/22/22 14:03:20]
       jinahub://ImagePreprocessor-skip-non-images. `--install-requirements` option is only                        
       valid when `uses` is a configuration file.                                                                  
WARNI… JINA@110763 Error getting the directory name from                                                           
       jinahub://ImagePreprocessor-skip-non-images. `--install-requirements` option is only                        
       valid when `uses` is a configuration file.                                                                  
WARNI… JINA@110763 Error getting the directory name from jinahub://CLIPEncoder/latest-gpu.                         
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
WARNI… JINA@110763 Error getting the directory name from jinahub://CLIPEncoder/latest-gpu.      [12/22/22 14:03:26]
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
WARNI… JINA@110763 Error getting the directory name from jinahub://AnnLiteIndexer.                                 
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
  Waiting all_indexer... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/1 0:00:00
DEBUG  index_table_extractor/rep-0@110763 ready and listening                                   [12/22/22 14:03:29]
DEBUG  index_segmenter/rep-0@110763 ready and listening                                         [12/22/22 14:03:29]
DEBUG  index_tagger/rep-0@110763 ready and listening                                            [12/22/22 14:03:29]
DEBUG  index_sentencizer/rep-0@110763 waiting for ready or shutdown signal from runtime         [12/22/22 14:03:29]
DEBUG  index_sentencizer/rep-0@110763 shutdown is is already set. Runtime will end gracefully                      
       on its own                                                                                                  
DEBUG  index_sentencizer/rep-0@110763 terminating the runtime process                                              
DEBUG  index_tags_copier/rep-0@110763 ready and listening                                       [12/22/22 14:03:29]
DEBUG  index_sentencizer/rep-0@110763 terminated                                                                   
DEBUG  index_sentencizer/rep-0@110763 joining the process                                                          
DEBUG  index_sentencizer/rep-0@110763 successfully joined the process                                              
DEBUG  index_image_processor/rep-0@110763 ready and listening                                   [12/22/22 14:03:29]
DEBUG  search_image_processor/rep-0@110763 ready and listening                                  [12/22/22 14:03:29]
DEBUG  gateway/rep-0@110763 ready and listening                                                 [12/22/22 14:03:30]
ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.
DEBUG  index_encoder/rep-0@111643 <executor.CLIPEncoder object at 0x7fc6e1cc7f70> is            [12/22/22 14:03:54]
       successfully loaded!                                                                                        
DEBUG  index_encoder/rep-0@111643 start listening on 0.0.0.0:63551                                                 
DEBUG  index_encoder/rep-0@111643 run grpc server forever                                                          
DEBUG  index_encoder/rep-0@110763 ready and listening                                           [12/22/22 14:03:54]
DEBUG  search_encoder/rep-0@111648 <executor.CLIPEncoder object at 0x7fc6e1cb4fa0> is           [12/22/22 14:04:16]
       successfully loaded!                                                                                        
DEBUG  search_encoder/rep-0@111648 start listening on 0.0.0.0:55492                                                
DEBUG  search_encoder/rep-0@111648 run grpc server forever                                                         
DEBUG  search_encoder/rep-0@110763 ready and listening                                          [12/22/22 14:04:16]
WARNI… all_indexer/rep-0@110763 <jina.orchestrate.pods.Pod object at 0x7fc6fc399550> timeout    [12/22/22 14:13:29]
       after waiting for 600000ms, if your executor takes time to load, you may increase                           
       --timeout-ready                                                                                             
DEBUG  all_indexer/rep-0@110763 waiting for ready or shutdown signal from runtime                                  
DEBUG  all_indexer/rep-0@110763 Runtime was never started. Runtime will end gracefully on its                      
       own                                                                                                         
DEBUG  all_indexer/rep-0@110763 terminating the runtime process                                                    
DEBUG  all_indexer/rep-0@110763 runtime process properly terminated                                                
DEBUG  all_indexer/rep-0@110763 terminated                                                                         
DEBUG  all_indexer/rep-0@110763 waiting for ready or shutdown signal from runtime                                  
DEBUG  all_indexer/rep-0@110763 shutdown is is already set. Runtime will end gracefully on its                     
       own                                                                                                         
DEBUG  all_indexer/rep-0@110763 terminating the runtime process                                 [12/22/22 14:13:30]
DEBUG  all_indexer/rep-0@110763 runtime process properly terminated                                                
DEBUG  all_indexer/rep-0@110763 terminated                                                                         
DEBUG  all_indexer/rep-0@110763 joining the process                                                                
JoanFM commented 1 year ago

I think you may need more memory to run this example, can u check the memory consumption while this is happening?

themantalope commented 1 year ago

On colab it's currently using 5GB out of 12 available.

The issue with runtime only occurred after turning on DEBUG logging.

JoanFM commented 1 year ago

Okey it seems that the indexer took too long to start. Have u tried this more than once?

alexcg1 commented 1 year ago

It seems like SpacySentencizer may be causing issues too. I created the Executor aaaages ago, and haven't seen these kind of errors before. Here's a minimum (not) working example notebook - @JoanFM any ideas why it's failing?

themantalope commented 1 year ago

Okey it seems that the indexer took too long to start. Have u tried this more than once?

Yes. Also, get to all the dependencies to install on colab properly I have had to restart the runtime.

JoanFM commented 1 year ago

I believe the spacy version on which it depends is not anymore compatible with Jina because of protobuf versions

themantalope commented 1 year ago

Ok that would make a lot of sense. I was getting protobuf errors, at one point getting an error stating that some component of the flow was looking for protobuf 3.18.something metadata but that protobuf > 4 was installed

themantalope commented 1 year ago

@JoanFM any sentencizer available through Jina that could be swapped?

JoanFM commented 1 year ago

u can find in the Executor Hub in Jina AI Cloud, there you may find some

themantalope commented 1 year ago

Ok, switched the sentenizer for the Torch sentenizer.

Now still having issues with the search code. Any help with this?

with flow:
  client = Client(port=flow.port)

  results = client.post(
      "/search",
      query_doc, 
      request_size=1,
      parameters={
          "filter": filter
      },
      show_progress=True, 
      target_executor="(search_*|all_*)"
      )

WARNI… JINA@6849 Error getting the directory name from jinahub://PDFSegmenter.                  [12/22/22 19:45:54]
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
WARNI… JINA@6849 Error getting the directory name from jinaai://jina-ai/Sentencizer:latest.                        
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
WARNI… JINA@6849 Error getting the directory name from jinahub://TransformerTorchEncoder.                          
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
WARNI… JINA@6849 Error getting the directory name from jinaai://jina-ai/AnnLiteIndexer:latest.                     
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
DeprecationWarning: 'index_traversal_paths' will be deprecated in the future, please use 'index_access_paths'. (raised from /usr/local/lib/python3.8/dist-packages/jina/serve/helper.py:73)
DeprecationWarning: 'search_traversal_paths' will be deprecated in the future, please use 'search_access_paths'. (raised from /usr/local/lib/python3.8/dist-packages/jina/serve/helper.py:73)
2022-12-22 19:45:55.767 | INFO     | annlite.index:restore:664 - restore Annlite from local
2022-12-22 19:45:55.771 | INFO     | annlite.index:_rebuild_index_from_local:771 - Rebuild the indexer from scratch
2022-12-22 19:45:59.060 | INFO     | annlite.index:_rebuild_index_from_local:788 - Load the model from /root/.cache/jina/AnnLiteIndexer/0/parameters-d67b9abb496ca1fd466b6d5378c78128
─────────────────────────────────────────── 🎉 Flow is ready to serve! ────────────────────────────────────────────
╭────────────── 🔗 Endpoint ───────────────╮
│  ⛓     Protocol                    GRPC  │
│  🏠       Local           [0.0.0.0](grpc://0.0.0.0:61118)[:](grpc://0.0.0.0:61118)[61118](grpc://0.0.0.0:61118)  │
│  🔒     Private       [172.28.0.12](grpc://172.28.0.12:61118)[:](grpc://172.28.0.12:61118)[61118](grpc://172.28.0.12:61118)  │
│  🌍      Public    [35.236.250.206](grpc://35.236.250.206:61118)[:](grpc://35.236.250.206:61118)[61118](grpc://35.236.250.206:61118)  │
╰──────────────────────────────────────────╯
⠋ Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00   0% ETA: -:--:--  
ERROR  all_indexer/rep-0@26547 ValueError('Empty ndarray. Did you forget to set                 [12/22/22 19:46:02]
       .embedding/.tensor value and now you are operating on it?')                                                 
        add "--quiet-error" to suppress the exception details                                                      
       Traceback (most recent call last):                                                                          
         File "/usr/local/lib/python3.8/dist-packages/jina/serve/runtimes/worker/__init__.py",                     
       line 264, in process_data                                                                                   
           result = await self._request_handler.handle(                                                            
         File                                                                                                      
       "/usr/local/lib/python3.8/dist-packages/jina/serve/runtimes/worker/request_handling.py",                    
       line 425, in handle                                                                                         
           return_data = await self._executor.__acall__(                                                           
         File "/usr/local/lib/python3.8/dist-packages/jina/serve/executors/__init__.py", line                      
       366, in __acall__                                                                                           
           return await self.__acall_endpoint__(req_endpoint, **kwargs)                                            
         File "/usr/local/lib/python3.8/dist-packages/jina/serve/executors/__init__.py", line                      
       425, in __acall_endpoint__                                                                                  
           return await exec_func(                                                                                 
         File "/usr/local/lib/python3.8/dist-packages/jina/serve/executors/__init__.py", line                      
       383, in exec_func                                                                                           
           return await get_or_reuse_loop().run_in_executor(                                                       
         File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run                                   
           result = self.fn(*self.args, **self.kwargs)                                                             
         File "/usr/local/lib/python3.8/dist-packages/jina/serve/executors/decorators.py", line                    
       187, in arg_wrapper                                                                                         
           return fn(executor_instance, *args, **kwargs)                                                           
         File "/root/.cache/jina/hub-package/7yypg8qk/executor.py", line 113, in search                            
           docs.match(self._index, filter=parameters.get('filter', None), limit=limit)                             
         File "/usr/local/lib/python3.8/dist-packages/docarray/array/mixins/match.py", line 77,                    
       in match                                                                                                    
           match_docs = darray.find(                                                                               
         File "/usr/local/lib/python3.8/dist-packages/docarray/array/mixins/find.py", line 200,                    
       in find                                                                                                     
           n_rows, n_dim = ndarray.get_array_rows(_query)                                                          
         File "/usr/local/lib/python3.8/dist-packages/docarray/math/ndarray.py", line 197, in                      
       get_array_rows                                                                                              
           array_type, _ = get_array_type(array)                                                                   
         File "/usr/local/lib/python3.8/dist-packages/docarray/math/ndarray.py", line 138, in                      
       get_array_type                                                                                              
           raise ValueError(                                                                                       
       ValueError: Empty ndarray. Did you forget to set .embedding/.tensor value and now you                       
       are operating on it?                                                                                        
Exception in thread Thread-134:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/dist-packages/jina/helper.py", line 1315, in run
    self.result = asyncio.run(func(*args, **kwargs))
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.8/dist-packages/jina/clients/mixin.py", line 266, in _get_results
    async for resp in c._get_results(*args, **kwargs):
  File "/usr/local/lib/python3.8/dist-packages/jina/clients/base/grpc.py", line 220, in _get_results
    async for resp in self._stream_rpc(
  File "/usr/local/lib/python3.8/dist-packages/jina/clients/base/grpc.py", line 85, in _stream_rpc
    callback_exec(
  File "/usr/local/lib/python3.8/dist-packages/jina/clients/helper.py", line 81, in callback_exec
    raise BadServer(response.header)
jina.excepts.BadServer: request_id: "19b8d3339121444eb214e5af03d51f5f"
status {
  code: ERROR
  description: "ValueError(\'Empty ndarray. Did you forget to set .embedding/.tensor value and now you are operating on it?\')"
  exception {
    name: "ValueError"
    args: "Empty ndarray. Did you forget to set .embedding/.tensor value and now you are operating on it?"
    stacks: "Traceback (most recent call last):\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/jina/serve/runtimes/worker/__init__.py\", line 264, in process_data\n    result = await self._request_handler.handle(\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/jina/serve/runtimes/worker/request_handling.py\", line 425, in handle\n    return_data = await self._executor.__acall__(\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/jina/serve/executors/__init__.py\", line 366, in __acall__\n    return await self.__acall_endpoint__(req_endpoint, **kwargs)\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/jina/serve/executors/__init__.py\", line 425, in __acall_endpoint__\n    return await exec_func(\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/jina/serve/executors/__init__.py\", line 383, in exec_func\n    return await get_or_reuse_loop().run_in_executor(\n"
    stacks: "  File \"/usr/lib/python3.8/concurrent/futures/thread.py\", line 57, in run\n    result = self.fn(*self.args, **self.kwargs)\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/jina/serve/executors/decorators.py\", line 187, in arg_wrapper\n    return fn(executor_instance, *args, **kwargs)\n"
    stacks: "  File \"/root/.cache/jina/hub-package/7yypg8qk/executor.py\", line 113, in search\n    docs.match(self._index, filter=parameters.get(\'filter\', None), limit=limit)\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/docarray/array/mixins/match.py\", line 77, in match\n    match_docs = darray.find(\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/docarray/array/mixins/find.py\", line 200, in find\n    n_rows, n_dim = ndarray.get_array_rows(_query)\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/docarray/math/ndarray.py\", line 197, in get_array_rows\n    array_type, _ = get_array_type(array)\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/docarray/math/ndarray.py\", line 138, in get_array_type\n    raise ValueError(\n"
    stacks: "ValueError: Empty ndarray. Did you forget to set .embedding/.tensor value and now you are operating on it?\n"
    executor: "AnnLiteIndexer"
  }
}
exec_endpoint: "/search"
target_executor: ""

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[/usr/local/lib/python3.8/dist-packages/jina/helper.py](https://localhost:8080/#) in run_async(func, *args, **kwargs)
   1329             try:
-> 1330                 return thread.result
   1331             except AttributeError:

AttributeError: '_RunThread' object has no attribute 'result'

During handling of the above exception, another exception occurred:

BadClient                                 Traceback (most recent call last)
2 frames
[/usr/local/lib/python3.8/dist-packages/jina/helper.py](https://localhost:8080/#) in run_async(func, *args, **kwargs)
   1332                 from jina.excepts import BadClient
   1333 
-> 1334                 raise BadClient(
   1335                     'something wrong when running the eventloop, result can not be retrieved'
   1336                 )

BadClient: something wrong when running the eventloop, result can not be retrieved

EDIT: Query parameters:

search_term = "trilobite diagram"
query_doc = Document(text=search_term)
element_type = [
    "text", 
    "image" 
    "table"
    ]
filter = {
    "element_type": {
        "$in": element_type,
    }
}
themantalope commented 1 year ago

I want to try to find a basic working example of text extraction from PDF (text within PDF, not text which needs to be OCR'd), index it with a neural encoder (something like the TransformerTorchEncoder) and search for it. I cannot get that working modifying the colab notebook or otherwise. Does anyone have something like that which works?

JoanFM commented 1 year ago

how exactly is the Flow that you are using right now?

AnneYang720 commented 1 year ago

@themantalope I think the problem is that colab has tensorflow pre-installed and its version is kinda old (2.9.2). And this tensorflow doesn't support the newer version of protobuf. You can check this as reference. I uninstall tensorflow before Flow.

themantalope commented 1 year ago

@JoanFM @AnneYang720

Thanks for your help, I really appreciate it.

Please take a look at this colab notebook which is a derivative of the link that @AnneYang720 sent.

Here is the flow:

flow = Flow().add(
        uses="jinahub://PDFSegmenter", # Extract images/text
        install_requirements=True,
        name="index_segmenter"
    ).add(
    uses="jinahub://SpacySentencizer", # Sentencize long text into sentences
    uses_with={"traversal_paths": "@c"},
    install_requirements=True,
    name="index_sentencizer"
    ).add(
      uses='jinahub://TransformerTorchEncoder',
       uses_with={"traversal_paths":"@cc"},
       install_requirements=True,
       name="encoder"
       ).add(
        uses="jinaai://jina-ai/AnnLiteIndexer", # Store vectors and metadata on disk
        uses_with={
            "index_traversal_paths": "@cc",
            "search_traversal_paths": "@cc",
            "columns": [("element_type", "str")],
            "n_dim": 512
            },
        install_requirements=True,
        name="all_indexer"
    )

The indexing works fine. However I get an error regarding empty embeddings during query:


search_term = "trilobite diagram"
query_doc = Document(text=search_term)
element_type = [
    "text", 
    "image" 
    "table"
    ]
filter = {
    "element_type": {
        "$in": element_type,
    }
}

with flow:
  client = Client(port=flow.port)

  results = client.post(
      "/search",
      query_doc, 
      request_size=1,
      parameters={
          "filter": filter
      },
      show_progress=True, 
      target_executor="(search_*|all_*)"
      )

WARNI… JINA@192 Error getting the directory name from jinahub://PDFSegmenter.                   [12/23/22 14:28:47]
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
WARNI… JINA@192 Error getting the directory name from jinahub://SpacySentencizer.                                  
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
WARNI… JINA@192 Error getting the directory name from jinahub://TransformerTorchEncoder.                           
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
WARNI… JINA@192 Error getting the directory name from jinaai://jina-ai/AnnLiteIndexer.                             
       `--install-requirements` option is only valid when `uses` is a configuration file.                          
DeprecationWarning: 'index_traversal_paths' will be deprecated in the future, please use 'index_access_paths'. (raised from /usr/local/lib/python3.8/dist-packages/jina/serve/helper.py:73)
RuntimeWarning: coroutine 'Flow._wait_until_all_ready.<locals>._f' was never awaited (raised from 
/usr/local/lib/python3.8/dist-packages/jina/orchestrate/flow/base.py:1890)
DeprecationWarning: 'search_traversal_paths' will be deprecated in the future, please use 'search_access_paths'. (raised from /usr/local/lib/python3.8/dist-packages/jina/serve/helper.py:73)
UserWarning: Using "columns" as a List of Tuples will be deprecated soon. Please provide a Dictionary. (raised from /usr/local/lib/python3.8/dist-packages/docarray/array/storage/base/backend.py:98)
2022-12-23 14:28:48.178 | INFO     | annlite.index:restore:664 - restore Annlite from local
2022-12-23 14:28:48.201 | INFO     | annlite.index:_rebuild_index_from_local:771 - Rebuild the indexer from scratch
2022-12-23 14:28:48.796 | INFO     | annlite.index:_rebuild_index_from_local:788 - Load the model from /root/.cache/jina/AnnLiteIndexer/0/parameters-d67b9abb496ca1fd466b6d5378c78128
─────────────────────────────────────────── 🎉 Flow is ready to serve! ────────────────────────────────────────────
╭────────────── 🔗 Endpoint ───────────────╮
│  ⛓       Protocol                  GRPC  │
│  🏠         Local         [0.0.0.0](grpc://0.0.0.0:50955)[:](grpc://0.0.0.0:50955)[50955](grpc://0.0.0.0:50955)  │
│  🔒       Private     [172.28.0.12](grpc://172.28.0.12:50955)[:](grpc://172.28.0.12:50955)[50955](grpc://172.28.0.12:50955)  │
╰──────────────────────────────────────────╯
⠋ Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00   0% ETA: -:--:--  
ERROR  all_indexer/rep-0@2573 ValueError('Empty ndarray. Did you forget to set                  [12/23/22 14:28:53]
       .embedding/.tensor value and now you are operating on it?')                                                 
        add "--quiet-error" to suppress the exception details                                                      
       Traceback (most recent call last):                                                                          
         File "/usr/local/lib/python3.8/dist-packages/jina/serve/runtimes/worker/__init__.py",                     
       line 264, in process_data                                                                                   
           result = await self._request_handler.handle(                                                            
         File                                                                                                      
       "/usr/local/lib/python3.8/dist-packages/jina/serve/runtimes/worker/request_handling.py",                    
       line 425, in handle                                                                                         
           return_data = await self._executor.__acall__(                                                           
         File "/usr/local/lib/python3.8/dist-packages/jina/serve/executors/__init__.py", line                      
       366, in __acall__                                                                                           
           return await self.__acall_endpoint__(req_endpoint, **kwargs)                                            
         File "/usr/local/lib/python3.8/dist-packages/jina/serve/executors/__init__.py", line                      
       425, in __acall_endpoint__                                                                                  
           return await exec_func(                                                                                 
         File "/usr/local/lib/python3.8/dist-packages/jina/serve/executors/__init__.py", line                      
       383, in exec_func                                                                                           
           return await get_or_reuse_loop().run_in_executor(                                                       
         File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run                                   
           result = self.fn(*self.args, **self.kwargs)                                                             
         File "/usr/local/lib/python3.8/dist-packages/jina/serve/executors/decorators.py", line                    
       187, in arg_wrapper                                                                                         
           return fn(executor_instance, *args, **kwargs)                                                           
         File "/root/.cache/jina/hub-package/7yypg8qk/executor.py", line 113, in search                            
           docs.match(self._index, filter=parameters.get('filter', None), limit=limit)                             
         File "/usr/local/lib/python3.8/dist-packages/docarray/array/mixins/match.py", line 77,                    
       in match                                                                                                    
           match_docs = darray.find(                                                                               
         File "/usr/local/lib/python3.8/dist-packages/docarray/array/mixins/find.py", line 200,                    
       in find                                                                                                     
           n_rows, n_dim = ndarray.get_array_rows(_query)                                                          
         File "/usr/local/lib/python3.8/dist-packages/docarray/math/ndarray.py", line 197, in                      
       get_array_rows                                                                                              
           array_type, _ = get_array_type(array)                                                                   
         File "/usr/local/lib/python3.8/dist-packages/docarray/math/ndarray.py", line 138, in                      
       get_array_type                                                                                              
           raise ValueError(                                                                                       
       ValueError: Empty ndarray. Did you forget to set .embedding/.tensor value and now you                       
       are operating on it?                                                                                        
RuntimeWarning: coroutine 'Flow._wait_until_all_ready.<locals>._async_wait_ready' was never awaited (raised from /usr/local/lib/python3.8/dist-packages/jina/orchestrate/flow/base.py:1782)
Exception in thread Thread-56:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/dist-packages/jina/helper.py", line 1315, in run
    self.result = asyncio.run(func(*args, **kwargs))
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.8/dist-packages/jina/clients/mixin.py", line 266, in _get_results
    async for resp in c._get_results(*args, **kwargs):
  File "/usr/local/lib/python3.8/dist-packages/jina/clients/base/grpc.py", line 220, in _get_results
    async for resp in self._stream_rpc(
  File "/usr/local/lib/python3.8/dist-packages/jina/clients/base/grpc.py", line 85, in _stream_rpc
    callback_exec(
  File "/usr/local/lib/python3.8/dist-packages/jina/clients/helper.py", line 81, in callback_exec
    raise BadServer(response.header)
jina.excepts.BadServer: request_id: "5ed3dcd36a8c4ff094d2766e9cb44857"
status {
  code: ERROR
  description: "ValueError(\'Empty ndarray. Did you forget to set .embedding/.tensor value and now you are operating on it?\')"
  exception {
    name: "ValueError"
    args: "Empty ndarray. Did you forget to set .embedding/.tensor value and now you are operating on it?"
    stacks: "Traceback (most recent call last):\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/jina/serve/runtimes/worker/__init__.py\", line 264, in process_data\n    result = await self._request_handler.handle(\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/jina/serve/runtimes/worker/request_handling.py\", line 425, in handle\n    return_data = await self._executor.__acall__(\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/jina/serve/executors/__init__.py\", line 366, in __acall__\n    return await self.__acall_endpoint__(req_endpoint, **kwargs)\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/jina/serve/executors/__init__.py\", line 425, in __acall_endpoint__\n    return await exec_func(\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/jina/serve/executors/__init__.py\", line 383, in exec_func\n    return await get_or_reuse_loop().run_in_executor(\n"
    stacks: "  File \"/usr/lib/python3.8/concurrent/futures/thread.py\", line 57, in run\n    result = self.fn(*self.args, **self.kwargs)\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/jina/serve/executors/decorators.py\", line 187, in arg_wrapper\n    return fn(executor_instance, *args, **kwargs)\n"
    stacks: "  File \"/root/.cache/jina/hub-package/7yypg8qk/executor.py\", line 113, in search\n    docs.match(self._index, filter=parameters.get(\'filter\', None), limit=limit)\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/docarray/array/mixins/match.py\", line 77, in match\n    match_docs = darray.find(\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/docarray/array/mixins/find.py\", line 200, in find\n    n_rows, n_dim = ndarray.get_array_rows(_query)\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/docarray/math/ndarray.py\", line 197, in get_array_rows\n    array_type, _ = get_array_type(array)\n"
    stacks: "  File \"/usr/local/lib/python3.8/dist-packages/docarray/math/ndarray.py\", line 138, in get_array_type\n    raise ValueError(\n"
    stacks: "ValueError: Empty ndarray. Did you forget to set .embedding/.tensor value and now you are operating on it?\n"
    executor: "AnnLiteIndexer"
  }
}
exec_endpoint: "/search"
target_executor: ""

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[/usr/local/lib/python3.8/dist-packages/jina/helper.py](https://localhost:8080/#) in run_async(func, *args, **kwargs)
   1329             try:
-> 1330                 return thread.result
   1331             except AttributeError:

AttributeError: '_RunThread' object has no attribute 'result'

During handling of the above exception, another exception occurred:

BadClient                                 Traceback (most recent call last)
2 frames
[/usr/local/lib/python3.8/dist-packages/jina/helper.py](https://localhost:8080/#) in run_async(func, *args, **kwargs)
   1332                 from jina.excepts import BadClient
   1333 
-> 1334                 raise BadClient(
   1335                     'something wrong when running the eventloop, result can not be retrieved'
   1336                 )

BadClient: something wrong when running the eventloop, result can not be retrieved

Based on other examples, it seems like this should work. When I google the error, it isn't clear to me how to fix it, or why during query the embedding tensor is not getting set.

JoanFM commented 1 year ago

the problem is that ur Encoder does not match the target executor parameter that you pass. Also, you may need to adapt the access_paths parameter.

Maybe u would prefer to have a separatr Flow for ur search

alexcg1 commented 1 year ago

Yeah, I'd suggest separate Flows for index and search. This example was mostly an experimental approach by me, using an older version of Jina. There are so many ways this could go wrong.

Probably easier to tear it apart and use the bits to start from scratch tbh. Consider it a fun Christmas experience :) (and you thought untangling christmas tree lights was frustrating)

alexcg1 commented 1 year ago

plus I'm no longer on this project and maintaining it several months later (after so long away from it) is tough.

Once only God and myself knew how my code worked. Now only God knows

AnneYang720 commented 1 year ago

For anyone concerned, here is a demo usage of sentenizer.

If you are using colab, you need to uninstall the pre-installed tensorflow and install jina

!pip uninstall -y tensorflow
!pip install jina

Create the flow

flow = (
    Flow()
    .add(
        uses="jinahub://PDFSegmenter",  # Extract images/text
        install_requirements=True,
        name="index_segmenter",
    )
    .add(
        uses="jinahub://SpacySentencizer",  # Sentencize long text into sentences
        uses_with={"traversal_paths": "@c"},
        install_requirements=True,
        name="index_sentencizer",
    )
    .add(
        uses='jinahub://TransformerTorchEncoder',
        uses_with={"traversal_paths": "@cc"},
        install_requirements=True,
        name="index_encoder",
    )
    .add(
        uses="jinaai://jina-ai/AnnLiteIndexer",
        uses_with={
            "index_traversal_paths": "@cc",
            "search_traversal_paths": "@cc",
            "columns": {"element_type": "str"},
            "n_dim": 768,
        },
        install_requirements=True,
        name="all_indexer",
    )
)

Prepare and index docs

import os

if not os.path.isdir("data"):
    !wget -q -N --output-document data.zip https://github.com/jina-ai/workshops/blob/main/notebooks/pdf_search/part_2_images_and_text/data.zip?raw=true
    !unzip -n data.zip
    !rm -f data.zip

from docarray import DocumentArray, Document

docs = DocumentArray.from_files("data/*.pdf")
for doc in docs:
    doc.load_uri_to_blob()

with flow:
    client = Client(port=flow.port)
    docs = client.post(
        "/index",
        docs,
        request_size=1,
        show_progress=True,
        target_executor="(index_*|all_*)",
    )

Search

search_term = "trilobite diagram"
query_doc = Document(text=search_term)
element_type = ["text", "image" "table"]
filter = {
    "element_type": {
        "$in": element_type,
    }
}

search_flow = (
    Flow()
    .add(
        uses='jinahub://TransformerTorchEncoder',
        install_requirements=True,
        name="index_encoder",
    )
    .add(
        uses="jinaai://jina-ai/AnnLiteIndexer",
        uses_with={
            "index_traversal_paths": "@cc",
            "search_traversal_paths": "@cc",
            "columns": {"element_type": "str"},
            "n_dim": 768,
        },
        install_requirements=True,
        name="all_indexer",
    )
)

with search_flow:
    client = Client(port=search_flow.port)

    results = client.post(
        "/search",
        query_doc,
        request_size=1,
        show_progress=True,
    )