Closed jmferrete closed 4 years ago
Hello Jose Manuel,
Thank you for opening the issue and happy to see your interest in jina.
Taking a look to your configuration I have some comments although I am not sure they are the root cause of your problem (Some of them are just recommendations). It would be good that you can share some more detailed logs so that we can take a closer look at them.
In your CustomModel, you are doing a hub.load()
call inside the encode
method, which should be done at _post_init() step. Otherwise u are loading the model for every batch (I guess it is somehow bypassed and may not have huge impact but I would recommend not to call the load so often).
A batch size of 1000 seems like a really big batch size that could affect the memory required by the query encoder
to work.
Another suggestion is that you can avoid having 2 indexers (index-chunk) and (index-doc) and keep only one (the CompoundIndexer). In your case, you are not breaking the documents (text) in different chunks. Therefore u can keep a single indexer
(the index-chunk used now (Although you can name it index-doc
since in your case no chunking
is involved)).
(For future work)When your dataset will grow, I would recommend trying different indexers different than NumpyIndexer
since it performs an exhaustive search that does not scale well for large amounts of data (see FaissIndexer, AnnoyIndexer ...)
These are just comments that may help you with your problem but I would also ask you to provide some more details of the logs.
I hope you find it helpful!
Thanks a lot Joan for your fast response.
We will apply with your recommendations and will back with more details and logs.
I've applied your recommendations to the flows. Here you can find the new gzip package with all the updated code.
But now it was unable to use it with the full dataset, due to we are experiencing an issue at indexing point. Basically all the index files (vec.gz files) are inconsistent after finishing the index process. And we are receiving these errors when loading them at query time:
vector_indexer@8184[I]:loading index from ./data/doc_indexer-1/vec.gz...
vector_indexer@8184[E]:./data/doc_indexer-1/vec.gz is broken/incomplete, perhaps forgot to ".close()" in the last usage?
vector_indexer@8184[C]:unknown exception: EOFError('Compressed file ended before the end-of-stream marker was reached')
Traceback (most recent call last):
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pea.py", line 332, in run
self.loop_body()
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pea.py", line 309, in loop_body
self.load_executor()
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pea.py", line 192, in load_executor
self.executor = BaseExecutor.load_config(self.args.uses if valid_local_config_source(self.args.uses) else self.args.uses_internal,
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/executors/__init__.py", line 416, in load_config
return yaml.load(tmp_s)
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/ruamel/yaml/main.py", line 343, in load
return constructor.get_single_data()
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 113, in get_single_data
return self.construct_document(node)
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 118, in construct_document
data = self.construct_object(node)
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 146, in construct_object
data = self.construct_non_recursive_object(node)
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 181, in construct_non_recursive_object
data = constructor(self, node)
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/executors/compound.py", line 351, in from_yaml
obj.components = lambda: data['components']
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/executors/compound.py", line 258, in components
self._set_routes()
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/executors/compound.py", line 308, in _set_routes
for m, _ in inspect.getmembers(c, predicate=inspect.ismethod):
File "/home/ec2-user/.pyenv/versions/3.8.5/lib/python3.8/inspect.py", line 350, in getmembers
value = getattr(object, key)
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/executors/indexers/__init__.py", line 109, in query_handler
self._query_handler = self.get_query_handler()
File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/executors/indexers/keyvalue/proto.py", line 99, in get_query_handler
tmp = fp.read()
File "/home/ec2-user/.pyenv/versions/3.8.5/lib/python3.8/gzip.py", line 292, in read
return self._buffer.read(size)
File "/home/ec2-user/.pyenv/versions/3.8.5/lib/python3.8/gzip.py", line 498, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
This is part of the output when the index process finishes:
vector_indexer@4841[I]:#sent: 597797 #recv: 597797 sent_size: 5.3 GB recv_size: 5.3 GB
vector_indexer@4841[I]:no update since 2020-08-25 07:09:45, will not save. If you really want to save it, call "touch()" before "save()" to force saving
vector_indexer@4841[I]:executor says there is nothing to save
vector_indexer@4841[I]:#sent: 597797 #recv: 597797 sent_size: 5.3 GB recv_size: 5.3 GB
vector_indexer@4841[S]:terminated
vector_indexer@4838[I]:received "control" from ctl▸⚐
vector_indexer@4838[I]:RequestLoopEnd() causes the breaking from the event loop
vector_indexer@4838[I]:#sent: 597797 #recv: 1195603 sent_size: 251.6 GB recv_size: 251.9 GB
vector_indexer@4838[I]:no update since 2020-08-25 07:09:45, will not save. If you really want to save it, call "touch()" before "save()" to force saving
vector_indexer@4838[I]:executor says there is nothing to save
vector_indexer@4838[I]:#sent: 597797 #recv: 1195603 sent_size: 251.6 GB recv_size: 251.9 GB
vector_indexer@4838[S]:terminated
encoder@4835[I]:received "control" from ctl▸⚐
encoder@4835[I]:RequestLoopEnd() causes the breaking from the event loop
encoder@4835[I]:#sent: 119562 #recv: 59781 sent_size: 25.2 GB recv_size: 19.4 GB
encoder@4835[I]:#sent: 119562 #recv: 59781 sent_size: 25.2 GB recv_size: 19.4 GB
encoder@4835[S]:terminated
encoder@4832[I]:received "control" from ctl▸⚐
encoder@4832[I]:RequestLoopEnd() causes the breaking from the event loop
encoder@4832[I]:#sent: 119560 #recv: 59780 sent_size: 25.2 GB recv_size: 19.4 GB
encoder@4832[I]:#sent: 119560 #recv: 59780 sent_size: 25.2 GB recv_size: 19.4 GB
encoder@4832[S]:terminated
encoder@4829[I]:received "control" from ctl▸⚐
encoder@4829[I]:RequestLoopEnd() causes the breaking from the event loop
encoder@4829[I]:#sent: 119566 #recv: 59783 sent_size: 25.2 GB recv_size: 19.4 GB
encoder@4829[I]:#sent: 119566 #recv: 59783 sent_size: 25.2 GB recv_size: 19.4 GB
encoder@4829[S]:terminated
vector_indexer@4871[S]:artifacts of this executor (vecidx) is persisted to ./data/doc_indexer-10/vecidx.bin
vector_indexer@4871[S]:artifacts of this executor (docidx) is persisted to ./data/doc_indexer-10/docidx.bin
vector_indexer@4871[I]:no update since 2020-08-25 07:09:45, will not save. If you really want to save it, call "touch()" before "save()" to force saving
vector_indexer@4871[I]:dumped changes to the executor, 41s since last the save
vector_indexer@4871[I]:#sent: 119562 #recv: 59781 sent_size: 580.4 MB recv_size: 25.2 GB
vector_indexer@4871[I]:indexer size: 2989000
vector_indexer@4871[I]:indexer size: 2989000
vector_indexer@4871[S]:terminated
The index process finished without printing any error message, but I don't know if the messages like no update since 2020-08-25 07:09:45, will not save. If you really want to save it, call "touch()" before "save()" to force saving
are alerting us that the gzip files are not being closed? Following the documentation:
If you use
with
scope to start a flow, then all resources (including pods of all kinds) of the flow will be released after your move out from the scope.
And we are using with
, as you can see in the next lines extracted from app.py
:
def index(num_docs):
config(num_docs, mode='index')
f = Flow().load_config('flow-index.yml')
with f:
f.index_lines(lines=index_fn(), batch_size=os.environ.get('BATCH_SIZE'))
So, in theory, we shouldn't need to close any file manually. But as you can see in the next test, most of them are inconsistent:
for f in $(find . -name "vec.gz"); do echo "Checking: $f" ; gzip -t $f ; done
Checking: ./data/doc_indexer-10/vec.gz
Checking: ./data/doc_indexer-5/vec.gz
gzip: ./data/doc_indexer-5/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-3/vec.gz
gzip: ./data/doc_indexer-3/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-1/vec.gz
gzip: ./data/doc_indexer-1/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-4/vec.gz
gzip: ./data/doc_indexer-4/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-9/vec.gz
gzip: ./data/doc_indexer-9/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-8/vec.gz
gzip: ./data/doc_indexer-8/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-6/vec.gz
gzip: ./data/doc_indexer-6/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-7/vec.gz
gzip: ./data/doc_indexer-7/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-2/vec.gz
gzip: ./data/doc_indexer-2/vec.gz: unexpected end of file
For giving you more information, we are using now the next configuration values:
And also note that this issue is not happening when we index only few gigabytes, but is happening when we index our full test dataset of 30 million of text documents.
Maybe we are doing something wrong. Could you please provide some clarity? Thanks in advance!
Hello @jmferrete,
I do not see anything wrong in your files, is it any chance that u can share the logs of your indexing
step? As you say, using Flow as a context manager guarantees that all the resources are properly closed and released but some problems have clearly come up here. Is it possible that you had any kind of disk space problem?
As I said it would be great if you can share the indexing logs to see if some errors show up that we missed.
Thank you very much
Thanks Joan,
In this case, the machine was created for this purpose, and it was provisioned with much more disk than enough (I think). The full index files are around 54GB and we have more than 3 TB free.
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 20G 8,5G 12G 44% /
/dev/md0 3,5T 295G 3,0T 9% /data
We are using the /data
volume for storing all the files. It contains both the source files and the index files.
It was configured with RAID 0 by using 2 NVME SSD disks. Only to for discard RAID problems, we also have tested on a single NVME disk in other machine and it is happening the same.
Here you can find the output logs.
Thanks again.
Hey @jmferrete ,
It is weird, it seems that the files were not properly closed, could you please tell me what jina version
you run your index with?
Sure, this is the one with I ran the last indexing process. I think is the same that I've posted in the description:
$ jina --version-full
jina 0.4.1
jina-proto 0.0.55
jina-vcs-tag (unset)
libzmq 4.3.2
pyzmq 1.18.5
protobuf 3.13.0
proto-backend cpp
grpcio 1.31.0
ruamel.yaml 0.16.10
python 3.8.5
platform Linux
platform-release 4.14.146-93.123.amzn1.x86_64
platform-version #1 SMP Tue Sep 24 00:45:23 UTC 2019
architecture x86_64
processor x86_64
jina-resources /home/ec2-user/.pyenv/versions/3.8.5/envs/jina/lib/python3.8/site-packages/jina/resources
JINA_ARRAY_QUANT (unset)
JINA_CONTRIB_MODULE (unset)
JINA_CONTRIB_MODULE_IS_LOADING(unset)
JINA_CONTROL_PORT (unset)
JINA_DEFAULT_HOST (unset)
JINA_EXECUTOR_WORKDIR (unset)
JINA_FULL_CLI (unset)
JINA_IPC_SOCK_TMP (unset)
JINA_LOG_FILE (unset)
JINA_LOG_LONG (unset)
JINA_LOG_NO_COLOR (unset)
JINA_LOG_PROFILING (unset)
JINA_LOG_SSE (unset)
JINA_LOG_VERBOSITY (unset)
JINA_POD_NAME (unset)
JINA_PROFILING (unset)
JINA_SOCKET_HWM (unset)
JINA_STACK_CONFIG (unset)
JINA_TEST_CONTAINER (unset)
JINA_TEST_GPU (unset)
JINA_TEST_PRETRAINED (unset)
JINA_VCS_VERSION (unset)
JINA_VERSION (unset)
JINA_WARN_UNNAMED (unset)
JINA_BINARY_DELIMITER (unset)
JINA_DISABLE_UVLOOP (unset)
Regards!
Sure, this is the one with I ran the last indexing process. I think is the same that I've posted in the description:
$ jina --version-full jina 0.4.1 jina-proto 0.0.55 jina-vcs-tag (unset) libzmq 4.3.2 pyzmq 1.18.5 protobuf 3.13.0 proto-backend cpp grpcio 1.31.0 ruamel.yaml 0.16.10 python 3.8.5 platform Linux platform-release 4.14.146-93.123.amzn1.x86_64 platform-version #1 SMP Tue Sep 24 00:45:23 UTC 2019 architecture x86_64 processor x86_64 jina-resources /home/ec2-user/.pyenv/versions/3.8.5/envs/jina/lib/python3.8/site-packages/jina/resources JINA_ARRAY_QUANT (unset) JINA_CONTRIB_MODULE (unset) JINA_CONTRIB_MODULE_IS_LOADING(unset) JINA_CONTROL_PORT (unset) JINA_DEFAULT_HOST (unset) JINA_EXECUTOR_WORKDIR (unset) JINA_FULL_CLI (unset) JINA_IPC_SOCK_TMP (unset) JINA_LOG_FILE (unset) JINA_LOG_LONG (unset) JINA_LOG_NO_COLOR (unset) JINA_LOG_PROFILING (unset) JINA_LOG_SSE (unset) JINA_LOG_VERBOSITY (unset) JINA_POD_NAME (unset) JINA_PROFILING (unset) JINA_SOCKET_HWM (unset) JINA_STACK_CONFIG (unset) JINA_TEST_CONTAINER (unset) JINA_TEST_GPU (unset) JINA_TEST_PRETRAINED (unset) JINA_VCS_VERSION (unset) JINA_VERSION (unset) JINA_WARN_UNNAMED (unset) JINA_BINARY_DELIMITER (unset) JINA_DISABLE_UVLOOP (unset)
Regards!
Oh yes, sorry for that
Hello @jmferrete,
For what I can see int he logs, just 2 of the shards
were properly closed (I assume those are the corrresponding to your
./data/doc_indexer-10
and ./data/doc_indexer-7
.
I think that the issue is that before calling close
on the executor we do some executor saving (all inside the same finally block), and this saving
must be raising some exception
that are not handled and avoids the actual closing of the index
files.
We will try to do some work on that and try to provide a fix or a better way to identify the issue.
Regards,
Joan
Edit: This is wrong, I was confused, the 2 peas properly terminated were the Head and Tail Peas
Ok!, thanks.
Hello @jmferrete ,
I have been able to reproduce the issue, and I was wrong, no exception was raised when saving the data files.
There is a bug resulting from the presence of dumped
executors from previous runs. If you make sure to clean your data before running the index flow it should be good to go.
Meanwhile we will work on solving this issue.
Thank you very much,
Regards
Hi @jmferrete
Besides what I said in the last message, I have two recommendations for your flow configuration and code.
Since there are no chunks involved in the process you should eliminate the ranker
from the query
flow, and in the app.py
you should change the line:
text = match.meta_info.decode()
for the line:
text = match.text
I hope it helps
Thanks a lot @JoanFM , we will test again following your recommendations.
We will double check that we are cleaning the data folder before executing the index process.
Hi @JoanFM ,
there must be another thing or we are not cleaning the right folders.
I've tried again with a full indexation making sure that we are cleaning the data/
folder, and we have exactly the same issue:
$> rm -rf data/*
$> rm -rf pods/__pycache__
$> ls -ltra data/
total 8
drwxrwxr-x 6 jferrete jferrete 4096 ago 25 17:59 ..
drwxrwxr-x 2 jferrete jferrete 4096 ago 25 20:55 .
$> PREFETCH=5 PREFETCH_ON_RECV=1 BATCH_SIZE=50 PARALLEL=10 SHARDS=10 python app.py -t index 2>&1 |pigz -p $(nproc) > jina-ai-output-2.log.gz
$> ls -ltra data/
total 48
drwxrwxr-x 6 jferrete jferrete 4096 ago 25 17:59 ..
drwxrwxr-x 12 jferrete jferrete 4096 ago 25 20:57 .
drwxrwxr-x 2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-1
drwxrwxr-x 2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-3
drwxrwxr-x 2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-2
drwxrwxr-x 2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-4
drwxrwxr-x 2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-5
drwxrwxr-x 2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-6
drwxrwxr-x 2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-7
drwxrwxr-x 2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-8
drwxrwxr-x 2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-9
drwxrwxr-x 2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-10
$> for f in $(find . -name "vec.gz"); do echo "Checking: $f" ; gzip -t $f ; done
Checking: ./data/doc_indexer-9/vec.gz
gzip: ./data/doc_indexer-9/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-1/vec.gz
gzip: ./data/doc_indexer-1/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-6/vec.gz
gzip: ./data/doc_indexer-6/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-10/vec.gz
gzip: ./data/doc_indexer-10/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-8/vec.gz
gzip: ./data/doc_indexer-8/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-7/vec.gz
gzip: ./data/doc_indexer-7/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-4/vec.gz
gzip: ./data/doc_indexer-4/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-2/vec.gz
gzip: ./data/doc_indexer-2/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-5/vec.gz
gzip: ./data/doc_indexer-5/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-3/vec.gz
gzip: ./data/doc_indexer-3/vec.gz: unexpected end of file
As you can see, the problem persists.
Are we missing something?, is there any other folder apart of data/
that we should clean?
Here you have the logs for this execution.
Thanks in advance.
Hi @jmferrete I just catch up on this thread, just FYI
Prefetch: 200 Batch size: 1000 Shards: 10 Parallel: 10
This means once fully running, you will have 200 * 1000 documents in memory at any given time.
I will read through the thread later
Thanks for the information @hanxiao.
We already decreased our batch size and prefetch settings since @JoanFM, commented here.
Our current index settings are these:
I'm marking this as high-prio now
➤ Lucia Loher commented:
So far; tested with 1 Mil. Documents, need to find good benchmark datasets
➤ Rutuja Surve commented:
We could have to define a new persistence strategy for saving the executors, by implementing get_state, especially applicable for huge datasets.
Also, a functionality that checks the consistency and validates the persistence by verifying check-sum of the persisted state with the expected state.
Hi,
We have some questions, have you tested this out in a large scale implementation? We will need to use Jina in a very large horizontal scaling implementation, and our idea is to be able to perform subsecond queries. Is this possible with Jina? Because we didn't found how to do this in your documentation.
Is there anything we can do for helping you with this issue?
Can we provide you some dataset so you can test with it in your side?
Thanks in advance.
Hi @jmferrete
Regarding the broken gz
/pickle bug, I believe #907 fixed it and has been rolled out in 0.5.2
. Let me recap what was the problem and how did we fix it.
First, this is a bug in our BasePea
lifecycle implementation, previously our BasePea
is set with daemon=True
and the executor persistence logic is written in the finally
block. Unfortunately, this finally
block is not guaranteed to be finished due to the following event chain.
__exit__
on every Pods. [main process]__exit__
on every Pea. [main process]__exit__
of the Pea, the [main process] sends a TERMINATE signal via zmq and asks the eventloop [the child process] to shutdown.TERMINATE
signal and sends back to tell the main process that it already closes all zmq context.finally
in the child process is called and the executor is now saved.After step 4 is done, the main process considers the child process is done and immediately return __exit__
to the upper level regardless the running status of finally
. In the end, the Flow terminates the main process, and since all child processes are daemon
processes, they are all killed as well.
Your problem is due to the child process in BasePea
is killed while writing executor in its finally
, yielding a broken file.
Thanks for finding it out and as you can imagine this is a pretty serious problem, and it did take us some time to find the reason.
Having the problem diagnosed, #907 is mainly about fixing the lifecycle and optimizing the shutdown logics:
BasePea
child process is not daemon
anymore__exit__
, we use self.join()
to wait the lifecycle of a BasePea
is finally ends. This guarantees that __exit__
is only returned after finally
block is finished.finally
block as finally
is not a 100% secure place to put teardown logic. Hope that clarifies.
You can now use pip install -U jina
for the upgrade. @JoanFM may also add an integrated test for this to ensure this never comes up again.
Thanks for the explaining and for the fast resolution.
We will now be able to continue doing our tests with Jina.
I will comment if I find something else.
Regards.
Describe your problem We are building a semantic text search system, using NumpyIndexer and cosine distance metric. We have been testing Jina with smaller sets for the moment, running in single instances, but our final goal is to run Jira in a distributed system over datasets with billions of documents.
As a first step, and to validate our configuration when querying smaller datasets, we have indexed around 30 million text documents (simple sentences), using the following settings:
(29889733 docs :white_check_mark: done in :stopwatch: 6595.7s :racehorse: 4531.8/s) So the indexing part is fine.
But we are having problems while performing queries over this data. Any query causes an out of memory error at OS level.
We tried running queries on differents environments:
But we are still having the same memory problems.
We would like to get some feedback about the problems we are having.
Dataset details: Regarding the input data (around 150GB of plain text files), it is spread into multiple files in the
./source/
folder. Each input file has the following structure: file1file2
Basically, multiline json files, where each line is a json document. And each document is formed by two fields:
text
: The text string (to be added as meta-information during indexing).vector
: A pre-generated [512,]float32
array generated from thetext
field by using the universal sentence encoder ("https://tfhub.dev/google/universal-sentence-encoder/4")Once the data is indexed, the resulting compressed dataset is around 56GB.
Here you can find the setup files tar.gz
What is your guess? We guess there must be something wrong in our configuration, as we didn’t expect to run into memory problems, given the dataset size and the instances we have used for testing.
Environment We have tried with differents environments, but the common one is this:
Screenshots
Extracted from logs:
┆Issue is synchronized with this Jira Task by Unito