Closed vitojph closed 4 years ago
I notice that the following warning appears quite often in your log.
splitter@8961[W]:doc 42 gives no chunk
As it suggests, this doc contains no chunk. What does that mean to the whole pipeline? Let's say your batch_size=2
, then each of your request has two documents. If you happens to have 2 empty-chunk-doc in a row. Then the encoder will have nothing to encode
with: it receives empty chunks and returns None
. The error ValueError('need at least one array to stack')
is no surprising then, as there is no embedding generated for this batch.
Quick fix suggested for you and for now:
max_sent_length
batch_size
and make it more unlikely to encounter this problem.And yes, a real fix is required on our side to ensure the workflow is robust regardless empty batch or not.
Thanks @hanxiao for the tips.
I tried with different parameters. My crafter has min_sent_len: 0
and max_sent_len: 256
and I increased the batch_size
during the indexing up to 64, but the process seems to get stuck.
Here are some additional logs, just in case they're useful:
$ python app.py -t index -n 500
Flow@32263[S]:successfully built Flow from a yaml config
Sentencizer@32271[I]:post initiating, this may take some time...
Sentencizer@32271[I]:post initiating, this may take some time takes 0.001 secs
Sentencizer@32271[S]:successfully built Sentencizer from a yaml config
splitter@32271[I]:setting up sockets...
splitter@32271[I]:input tcp://0.0.0.0:40937 (SUB_CONNECT) output tcp://0.0.0.0:37997 (PUSH_CONNECT) control over tcp://0.0.0.0:41927 (PAIR_BIND)
splitter@32271[S]:ready and listening
TransformerTorc@32274[I]:post initiating, this may take some time...
TransformerTorc@32274[I]:post initiating, this may take some time takes 2.578 secs
TransformerTorc@32274[S]:successfully built TransformerTorchEncoder from a yaml config
encoder@32274[I]:setting up sockets...
encoder@32274[I]:input tcp://0.0.0.0:37997 (PULL_BIND) output tcp://0.0.0.0:53499 (PUSH_CONNECT) control over tcp://0.0.0.0:37609 (PAIR_BIND)
encoder@32274[S]:ready and listening
NumpyIndexer@32279[I]:post initiating, this may take some time...
NumpyIndexer@32279[I]:post initiating, this may take some time takes 0.001 secs
NumpyIndexer@32279[S]:successfully built NumpyIndexer from a yaml config
BasePbIndexer@32279[I]:post initiating, this may take some time...
BasePbIndexer@32279[I]:post initiating, this may take some time takes 0.000 secs
BasePbIndexer@32279[S]:successfully built BasePbIndexer from a yaml config
ChunkIndexer@32279[I]:post initiating, this may take some time...
ChunkIndexer@32279[I]:post initiating, this may take some time takes 0.000 secs
ChunkIndexer@32279[S]:successfully built ChunkIndexer from a yaml config
NumpyIndexer@32279[W]:you can not query from <jina.executors.indexers.vector.numpy.NumpyIndexer object at 0x7f8142dd7dd8> as its "query_handler" is not set. If you are indexing data then that is fine, just means you can not do querying-while-indexing.If you are querying data then the index file must be broken.
chunk_indexer@32279[I]:setting up sockets...
chunk_indexer@32279[I]:input tcp://0.0.0.0:53499 (PULL_BIND) output tcp://0.0.0.0:45679 (PUSH_CONNECT) control over tcp://0.0.0.0:58605 (PAIR_BIND)
chunk_indexer@32279[S]:ready and listening
DocPbIndexer@32282[I]:post initiating, this may take some time...
DocPbIndexer@32282[I]:post initiating, this may take some time takes 0.000 secs
DocPbIndexer@32282[S]:successfully built DocPbIndexer from a yaml config
doc_indexer@32282[I]:setting up sockets...
doc_indexer@32282[I]:input tcp://0.0.0.0:40937 (SUB_CONNECT) output tcp://0.0.0.0:45679 (PUSH_CONNECT) control over tcp://0.0.0.0:56505 (PAIR_BIND)
doc_indexer@32282[S]:ready and listening
BaseExecutor@32285[I]:post initiating, this may take some time...
BaseExecutor@32285[I]:post initiating, this may take some time takes 0.000 secs
BaseExecutor@32285[S]:successfully built BaseExecutor from a yaml config
join_all@32285[I]:setting up sockets...
join_all@32285[I]:input tcp://0.0.0.0:45679 (PULL_BIND) output tcp://0.0.0.0:54945 (PUSH_BIND) control over tcp://0.0.0.0:46895 (PAIR_BIND)
join_all@32285[S]:ready and listening
BaseExecutor@32263[I]:post initiating, this may take some time...
BaseExecutor@32263[I]:post initiating, this may take some time takes 0.001 secs
GatewayPea@32263[S]:gateway is listening at: 0.0.0.0:47853
Flow@32263[I]:6 Pods (i.e. 6 Peas) are running in this Flow
Flow@32263[S]:flow is now ready for use, current build_level is GRAPH
PyClient@32263[S]:connected to the gateway at 0.0.0.0:47853!
index [= ] 📃 0 ⏱️ 0.0s 🐎 0.0/s 0 batchindex ... gateway@32263[I]:setting up sockets...
gateway@32263[I]:input tcp://0.0.0.0:54945 (PULL_CONNECT) output tcp://0.0.0.0:40937 (PUB_BIND) control over ipc:///tmp/tmp16sxf8kk (PAIR_BIND)
gateway@32263[I]:prefetching 50 requests...
gateway@32263[W]:if this takes too long, you may want to take smaller "--prefetch" or ask client to reduce "--batch-size"
gateway@32263[I]:prefetching 50 requests takes 0.007 secs
gateway@32263[I]:send: 0 recv: 0 pending: 0
And after cancelling the script:
^C [183.194 secs]
✅ done in ⏱ 183.2s 🐎 0.0/s
chunk_indexer@32279[W]:user cancel the process
doc_indexer@32282[W]:user cancel the process
splitter@32271[W]:user cancel the process
join_all@32285[W]:user cancel the process
NumpyIndexer@32279[I]:no update since 2020-05-07 06:47:01, will not save. If you really want to save it, call "touch()" before "save()" to force saving
DocPbIndexer@32282[I]:no update since 2020-05-07 06:47:01, will not save. If you really want to save it, call "touch()" before "save()" to force saving
PyClient@32263[W]:user cancel the process
BasePbIndexer@32279[I]:no update since 2020-05-07 06:47:01, will not save. If you really want to save it, call "touch()" before "save()" to force saving
splitter@32271[I]:#sent: 0 #recv: 0 sent_size: 0 Bytes recv_size: 0 Bytes
doc_indexer@32282[I]:executor says there is nothing to save
ChunkIndexer@32279[I]:no update since 2020-05-07 06:47:01, will not save. If you really want to save it, call "touch()" before "save()" to force saving
chunk_indexer@32279[I]:dumped changes to the executor, 183s since last the save
doc_indexer@32282[I]:#sent: 0 #recv: 0 sent_size: 0 Bytes recv_size: 0 Bytes
chunk_indexer@32279[I]:#sent: 0 #recv: 0 sent_size: 0 Bytes recv_size: 0 Bytes
encoder@32274[W]:user cancel the process
encoder@32274[I]:#sent: 0 #recv: 0 sent_size: 0 Bytes recv_size: 0 Bytes
NumpyIndexer@32279[W]:you can not query from <jina.executors.indexers.vector.numpy.NumpyIndexer object at 0x7f8142dd7dd8> as its "query_handler" is not set. If you are indexing data then that is fine, just means you can not do querying-while-indexing.If you are querying data then the index file must be broken.
join_all@32285[I]:#sent: 0 #recv: 0 sent_size: 0 Bytes recv_size: 0 Bytes
PyClient@32263[S]:terminated
splitter@32271[I]:#sent: 0 #recv: 0 sent_size: 0 Bytes recv_size: 0 Bytes
splitter@32271[S]:terminated
GatewayPea@32263[S]:terminated
chunk_indexer@32279[I]:#sent: 0 #recv: 0 sent_size: 0 Bytes recv_size: 0 Bytes
join_all@32285[I]:#sent: 0 #recv: 0 sent_size: 0 Bytes recv_size: 0 Bytes
join_all@32285[S]:terminated
chunk_indexer@32279[S]:terminated
encoder@32274[I]:#sent: 0 #recv: 0 sent_size: 0 Bytes recv_size: 0 Bytes
doc_indexer@32282[I]:#sent: 0 #recv: 0 sent_size: 0 Bytes recv_size: 0 Bytes
encoder@32274[S]:terminated
doc_indexer@32282[S]:terminated
Flow@32263[S]:flow is closed and all resources should be released already, current build level is EMPTY
done
Describe your problem
I'm trying to reproduce the BERT-based Semantic Search Engine with a different collection. Unlike the SouthPark example, my corpus is made of short text documents, a couple of paragraphs long. I preprocessed my collection to segment the sentences using spaCy, and generate a single two-column CSV file with the following structure:
doc_id, text
What is your guess?
Whenever I try to index this collection, the process gets stuck after a
ValueError: need at least one array to stack
error:Click here to see the complete logs
``` $ python app.py -t index -n 100 Flow@8953[S]:successfully built Flow from a yaml config Sentencizer@8961[I]:post initiating, this may take some time... Sentencizer@8961[I]:post initiating, this may take some time takes 0.001 secs Sentencizer@8961[S]:successfully built Sentencizer from a yaml config splitter@8961[I]:setting up sockets... splitter@8961[I]:input tcp://0.0.0.0:56579 (SUB_CONNECT) output tcp://0.0.0.0:39789 (PUSH_CONNECT) control over tcp://0.0.0.0:42053 (PAIR_BIND) splitter@8961[S]:ready and listening TransformerTorc@8964[I]:post initiating, this may take some time... TransformerTorc@8964[I]:post initiating, this may take some time takes 2.714 secs TransformerTorc@8964[S]:successfully built TransformerTorchEncoder from a yaml config encoder@8964[I]:setting up sockets... encoder@8964[I]:input tcp://0.0.0.0:39789 (PULL_BIND) output tcp://0.0.0.0:51541 (PUSH_CONNECT) co ntrol over tcp://0.0.0.0:35377 (PAIR_BIND) encoder@8964[S]:ready and listening NumpyIndexer@8969[I]:post initiating, this may take some time... NumpyIndexer@8969[I]:post initiating, this may take some time takes 0.001 secs NumpyIndexer@8969[S]:successfully built NumpyIndexer from a yaml config BasePbIndexer@8969[I]:post initiating, this may take some time... BasePbIndexer@8969[I]:post initiating, this may take some time takes 0.000 secs BasePbIndexer@8969[S]:successfully built BasePbIndexer from a yaml config ChunkIndexer@8969[I]:post initiating, this may take some time... ChunkIndexer@8969[I]:post initiating, this may take some time takes 0.000 secs ChunkIndexer@8969[S]:successfully built ChunkIndexer from a yaml config NumpyIndexer@8969[W]:you can not query fromWhen I press ctrl-c, the logs continue:
Click here
``` ^C [1486.595 secs] ✅ done in ⏱ 1486.6s 🐎 0.0/s chunk_indexer@8969[W]:user cancel the process doc_indexer@8972[W]:user cancel the process join_all@8975[W]:user cancel the process DocPbIndexer@8972[I]:no update since 2020-05-06 13:41:36, will not save. If you really want to save it, call "touch()" before "save()" to force savin g doc_indexer@8972[I]:executor says there is nothing to save PyClient@8953[W]:user cancel the process doc_indexer@8972[I]:#sent: 0 #recv: 0 sent_size: 0 Bytes recv_size: 0 By tes join_all@8975[I]:#sent: 0 #recv: 3 sent_size: 0 Bytes recv_size: 1.7 KB splitter@8961[W]:user cancel the process splitter@8961[I]:#sent: 50 #recv: 50 sent_size: 26.9 KB recv_size: 20.9 KB NumpyIndexer@8969[S]:artifacts of this executor (vecidx) is persisted to /home/aiteam/projects/jina-examples/test/sbnpsago/chunk_indexer-0/vecidx.bin BasePbIndexer@8969[S]:artifacts of this executor (chunkidx) is persisted to /home/aiteam/projects/jina-examples/test/sbnpsago/chunk_indexer-0/chunkidx.bin PyClient@8953[S]:terminated join_all@8975[I]:#sent: 0 #recv: 3 sent_size: 0 Bytes recv_size: 1.7 KB ChunkIndexer@8969[I]:no update since 2020-05-06 13:41:36, will not save. If you really want to save it, call "touch()" before "save()" to force saving chunk_indexer@8969[I]:dumped changes to the executor, 1487s since last the save join_all@8975[S]:terminated chunk_indexer@8969[I]:#sent: 3 #recv: 3 sent_size: 1.7 KB recv_size: 10.9 KB splitter@8961[I]:#sent: 50 #recv: 50 sent_size: 26.9 KB recv_size: 20.9 KB splitter@8961[S]:terminated GatewayPea@8953[S]:terminated doc_indexer@8972[I]:#sent: 0 #recv: 0 sent_size: 0 Bytes recv_size: 0 Bytes doc_indexer@8972[S]:terminated ```Environment
The example with the SouthPark documents works. Any idea of what's going on? Thanks in advance.