jmferrete commented 4 years ago

Describe your problem We are building a semantic text search system, using NumpyIndexer and cosine distance metric. We have been testing Jina with smaller sets for the moment, running in single instances, but our final goal is to run Jira in a distributed system over datasets with billions of documents.

As a first step, and to validate our configuration when querying smaller datasets, we have indexed around 30 million text documents (simple sentences), using the following settings:

Prefetch: 200
Batch size: 1000
Shards: 10
Parallel: 10

(29889733 docs :white_check_mark: done in :stopwatch: 6595.7s :racehorse: 4531.8/s) So the indexing part is fine.

But we are having problems while performing queries over this data. Any query causes an out of memory error at OS level.

We tried running queries on differents environments:

Machines with up to 240 GB of RAM.
Decreasing settings down to:
Prefetch: 10
Batch size: 50
Shards: 1
Parallel: 1

But we are still having the same memory problems.

We would like to get some feedback about the problems we are having.

Dataset details: Regarding the input data (around 150GB of plain text files), it is spread into multiple files in the ./source/ folder. Each input file has the following structure: file1

{"text": "my sample text", "vector": [ 0.1234567, -0.2345678, 0.3456789...]}
{"text": "my other sample text", "vector": [ 0.4567890, 0.5678901, -0.6789012...]}
...

file2

{"text": "my sample text", "vector": [ 0.1234567, -0.2345678, 0.3456789...]}
{"text": "my other sample text", "vector": [ 0.4567890, 0.5678901, -0.6789012...]}
...

Basically, multiline json files, where each line is a json document. And each document is formed by two fields:

text: The text string (to be added as meta-information during indexing).
vector: A pre-generated [512,] float32 array generated from the text field by using the universal sentence encoder ("https://tfhub.dev/google/universal-sentence-encoder/4")

Once the data is indexed, the resulting compressed dataset is around 56GB.

Here you can find the setup files tar.gz

What is your guess? We guess there must be something wrong in our configuration, as we didn’t expect to run into memory problems, given the dataset size and the instances we have used for testing.

Environment We have tried with differents environments, but the common one is this:

(jina) [ec2-user@ip-10-200-7-102 ~]$ jina --version-full
jina                          0.4.1
jina-proto                    0.0.55
jina-vcs-tag                  (unset)
libzmq                        4.3.2
pyzmq                         1.18.5
protobuf                      3.13.0
proto-backend                 cpp
grpcio                        1.31.0
ruamel.yaml                   0.16.10
python                        3.8.5
platform                      Linux
platform-release              4.14.146-93.123.amzn1.x86_64
platform-version              #1 SMP Tue Sep 24 00:45:23 UTC 2019
architecture                  x86_64
processor                     x86_64
jina-resources                /home/ec2-user/.pyenv/versions/3.8.5/envs/jina/lib/python3.8/site-packages/jina/resources
JINA_ARRAY_QUANT              (unset)
JINA_CONTRIB_MODULE           (unset)
JINA_CONTRIB_MODULE_IS_LOADING(unset)
JINA_CONTROL_PORT             (unset)
JINA_DEFAULT_HOST             (unset)
JINA_EXECUTOR_WORKDIR         (unset)
JINA_FULL_CLI                 (unset)
JINA_IPC_SOCK_TMP             (unset)
JINA_LOG_FILE                 (unset)
JINA_LOG_LONG                 (unset)
JINA_LOG_NO_COLOR             (unset)
JINA_LOG_PROFILING            (unset)
JINA_LOG_SSE                  (unset)
JINA_LOG_VERBOSITY            (unset)
JINA_POD_NAME                 (unset)
JINA_PROFILING                (unset)
JINA_SOCKET_HWM               (unset)
JINA_STACK_CONFIG             (unset)
JINA_TEST_CONTAINER           (unset)
JINA_TEST_GPU                 (unset)
JINA_TEST_PRETRAINED          (unset)
JINA_VCS_VERSION              (unset)
JINA_VERSION                  (unset)
JINA_WARN_UNNAMED             (unset)
JINA_BINARY_DELIMITER         (unset)
JINA_DISABLE_UVLOOP           (unset)

(jina) [ec2-user@ip-10-200-7-102 ~]$ cat /etc/issue
Amazon Linux AMI release 2018.03
Kernel \r on an \m

Screenshots

Extracted from logs:

...
Unable to allocate 5.70 GiB for an array with shape (2988733, 512) and data type float32
...

┆Issue is synchronized with this Jira Task by Unito

JoanFM commented 4 years ago

Hello Jose Manuel,

Thank you for opening the issue and happy to see your interest in jina.

Taking a look to your configuration I have some comments although I am not sure they are the root cause of your problem (Some of them are just recommendations). It would be good that you can share some more detailed logs so that we can take a closer look at them.

In your CustomModel, you are doing a hub.load() call inside the encode method, which should be done at _post_init() step. Otherwise u are loading the model for every batch (I guess it is somehow bypassed and may not have huge impact but I would recommend not to call the load so often).
A batch size of 1000 seems like a really big batch size that could affect the memory required by the query encoder to work.
Another suggestion is that you can avoid having 2 indexers (index-chunk) and (index-doc) and keep only one (the CompoundIndexer). In your case, you are not breaking the documents (text) in different chunks. Therefore u can keep a single indexer (the index-chunk used now (Although you can name it index-doc since in your case no chunking is involved)).
(For future work)When your dataset will grow, I would recommend trying different indexers different than NumpyIndexer since it performs an exhaustive search that does not scale well for large amounts of data (see FaissIndexer, AnnoyIndexer ...)

These are just comments that may help you with your problem but I would also ask you to provide some more details of the logs.

I hope you find it helpful!

jmferrete commented 4 years ago

Thanks a lot Joan for your fast response.

We will apply with your recommendations and will back with more details and logs.

jmferrete commented 4 years ago

I've applied your recommendations to the flows. Here you can find the new gzip package with all the updated code.

But now it was unable to use it with the full dataset, due to we are experiencing an issue at indexing point. Basically all the index files (vec.gz files) are inconsistent after finishing the index process. And we are receiving these errors when loading them at query time:

 vector_indexer@8184[I]:loading index from ./data/doc_indexer-1/vec.gz...
 vector_indexer@8184[E]:./data/doc_indexer-1/vec.gz is broken/incomplete, perhaps forgot to ".close()" in the last usage?
 vector_indexer@8184[C]:unknown exception: EOFError('Compressed file ended before the end-of-stream marker was reached')
Traceback (most recent call last):
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pea.py", line 332, in run
    self.loop_body()
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pea.py", line 309, in loop_body
    self.load_executor()
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pea.py", line 192, in load_executor
    self.executor = BaseExecutor.load_config(self.args.uses if valid_local_config_source(self.args.uses) else self.args.uses_internal,
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/executors/__init__.py", line 416, in load_config
    return yaml.load(tmp_s)
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/ruamel/yaml/main.py", line 343, in load
    return constructor.get_single_data()
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 113, in get_single_data
    return self.construct_document(node)
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 118, in construct_document
    data = self.construct_object(node)
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 146, in construct_object
    data = self.construct_non_recursive_object(node)
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 181, in construct_non_recursive_object
    data = constructor(self, node)
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/executors/compound.py", line 351, in from_yaml
    obj.components = lambda: data['components']
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/executors/compound.py", line 258, in components
    self._set_routes()
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/executors/compound.py", line 308, in _set_routes
    for m, _ in inspect.getmembers(c, predicate=inspect.ismethod):
  File "/home/ec2-user/.pyenv/versions/3.8.5/lib/python3.8/inspect.py", line 350, in getmembers
    value = getattr(object, key)
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/executors/indexers/__init__.py", line 109, in query_handler
    self._query_handler = self.get_query_handler()
  File "/home/ec2-user/.pyenv/versions/jina/lib/python3.8/site-packages/jina/executors/indexers/keyvalue/proto.py", line 99, in get_query_handler
    tmp = fp.read()
  File "/home/ec2-user/.pyenv/versions/3.8.5/lib/python3.8/gzip.py", line 292, in read
    return self._buffer.read(size)
  File "/home/ec2-user/.pyenv/versions/3.8.5/lib/python3.8/gzip.py", line 498, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

This is part of the output when the index process finishes:

 vector_indexer@4841[I]:#sent: 597797 #recv: 597797 sent_size: 5.3 GB recv_size: 5.3 GB
 vector_indexer@4841[I]:no update since 2020-08-25 07:09:45, will not save. If you really want to save it, call "touch()" before "save()" to force saving
 vector_indexer@4841[I]:executor says there is nothing to save
 vector_indexer@4841[I]:#sent: 597797 #recv: 597797 sent_size: 5.3 GB recv_size: 5.3 GB
 vector_indexer@4841[S]:terminated
 vector_indexer@4838[I]:received "control" from ctl▸⚐
 vector_indexer@4838[I]:RequestLoopEnd() causes the breaking from the event loop
 vector_indexer@4838[I]:#sent: 597797 #recv: 1195603 sent_size: 251.6 GB recv_size: 251.9 GB
 vector_indexer@4838[I]:no update since 2020-08-25 07:09:45, will not save. If you really want to save it, call "touch()" before "save()" to force saving
 vector_indexer@4838[I]:executor says there is nothing to save
 vector_indexer@4838[I]:#sent: 597797 #recv: 1195603 sent_size: 251.6 GB recv_size: 251.9 GB
 vector_indexer@4838[S]:terminated
        encoder@4835[I]:received "control" from ctl▸⚐
        encoder@4835[I]:RequestLoopEnd() causes the breaking from the event loop
        encoder@4835[I]:#sent: 119562 #recv: 59781 sent_size: 25.2 GB recv_size: 19.4 GB
        encoder@4835[I]:#sent: 119562 #recv: 59781 sent_size: 25.2 GB recv_size: 19.4 GB
        encoder@4835[S]:terminated
        encoder@4832[I]:received "control" from ctl▸⚐
        encoder@4832[I]:RequestLoopEnd() causes the breaking from the event loop
        encoder@4832[I]:#sent: 119560 #recv: 59780 sent_size: 25.2 GB recv_size: 19.4 GB
        encoder@4832[I]:#sent: 119560 #recv: 59780 sent_size: 25.2 GB recv_size: 19.4 GB
        encoder@4832[S]:terminated
        encoder@4829[I]:received "control" from ctl▸⚐
        encoder@4829[I]:RequestLoopEnd() causes the breaking from the event loop
        encoder@4829[I]:#sent: 119566 #recv: 59783 sent_size: 25.2 GB recv_size: 19.4 GB
        encoder@4829[I]:#sent: 119566 #recv: 59783 sent_size: 25.2 GB recv_size: 19.4 GB
        encoder@4829[S]:terminated
 vector_indexer@4871[S]:artifacts of this executor (vecidx) is persisted to ./data/doc_indexer-10/vecidx.bin
 vector_indexer@4871[S]:artifacts of this executor (docidx) is persisted to ./data/doc_indexer-10/docidx.bin
 vector_indexer@4871[I]:no update since 2020-08-25 07:09:45, will not save. If you really want to save it, call "touch()" before "save()" to force saving
 vector_indexer@4871[I]:dumped changes to the executor,  41s since last the save
 vector_indexer@4871[I]:#sent: 119562 #recv: 59781 sent_size: 580.4 MB recv_size: 25.2 GB
 vector_indexer@4871[I]:indexer size: 2989000
 vector_indexer@4871[I]:indexer size: 2989000
 vector_indexer@4871[S]:terminated

The index process finished without printing any error message, but I don't know if the messages like no update since 2020-08-25 07:09:45, will not save. If you really want to save it, call "touch()" before "save()" to force saving are alerting us that the gzip files are not being closed? Following the documentation:

If you use with scope to start a flow, then all resources (including pods of all kinds) of the flow will be released after your move out from the scope.

And we are using with, as you can see in the next lines extracted from app.py:

def index(num_docs):
    config(num_docs, mode='index')
    f = Flow().load_config('flow-index.yml')

    with f:
        f.index_lines(lines=index_fn(), batch_size=os.environ.get('BATCH_SIZE'))

So, in theory, we shouldn't need to close any file manually. But as you can see in the next test, most of them are inconsistent:

for f in $(find . -name "vec.gz"); do echo "Checking: $f" ; gzip -t $f ; done
Checking: ./data/doc_indexer-10/vec.gz
Checking: ./data/doc_indexer-5/vec.gz

gzip: ./data/doc_indexer-5/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-3/vec.gz

gzip: ./data/doc_indexer-3/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-1/vec.gz

gzip: ./data/doc_indexer-1/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-4/vec.gz

gzip: ./data/doc_indexer-4/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-9/vec.gz

gzip: ./data/doc_indexer-9/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-8/vec.gz

gzip: ./data/doc_indexer-8/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-6/vec.gz

gzip: ./data/doc_indexer-6/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-7/vec.gz

gzip: ./data/doc_indexer-7/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-2/vec.gz

gzip: ./data/doc_indexer-2/vec.gz: unexpected end of file

For giving you more information, we are using now the next configuration values:

Prefetch: 5
Prefetch_on_recv: 1
Batch size: 50
Shards: 10
Parallel: 10

And also note that this issue is not happening when we index only few gigabytes, but is happening when we index our full test dataset of 30 million of text documents.

Maybe we are doing something wrong. Could you please provide some clarity? Thanks in advance!

JoanFM commented 4 years ago

Hello @jmferrete,

I do not see anything wrong in your files, is it any chance that u can share the logs of your indexing step? As you say, using Flow as a context manager guarantees that all the resources are properly closed and released but some problems have clearly come up here. Is it possible that you had any kind of disk space problem?

As I said it would be great if you can share the indexing logs to see if some errors show up that we missed.

Thank you very much

jmferrete commented 4 years ago

Thanks Joan,

In this case, the machine was created for this purpose, and it was provisioned with much more disk than enough (I think). The full index files are around 54GB and we have more than 3 TB free.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       20G  8,5G   12G  44% /
/dev/md0        3,5T  295G  3,0T   9% /data

We are using the /data volume for storing all the files. It contains both the source files and the index files. It was configured with RAID 0 by using 2 NVME SSD disks. Only to for discard RAID problems, we also have tested on a single NVME disk in other machine and it is happening the same.

Here you can find the output logs.

Thanks again.

JoanFM commented 4 years ago

Hey @jmferrete ,

It is weird, it seems that the files were not properly closed, could you please tell me what jina version you run your index with?

jmferrete commented 4 years ago

Sure, this is the one with I ran the last indexing process. I think is the same that I've posted in the description:

$ jina --version-full
jina                          0.4.1   
jina-proto                    0.0.55  
jina-vcs-tag                  (unset)
libzmq                        4.3.2
pyzmq                         1.18.5
protobuf                      3.13.0
proto-backend                 cpp
grpcio                        1.31.0
ruamel.yaml                   0.16.10
python                        3.8.5
platform                      Linux
platform-release              4.14.146-93.123.amzn1.x86_64
platform-version              #1 SMP Tue Sep 24 00:45:23 UTC 2019
architecture                  x86_64
processor                     x86_64
jina-resources                /home/ec2-user/.pyenv/versions/3.8.5/envs/jina/lib/python3.8/site-packages/jina/resources
JINA_ARRAY_QUANT              (unset)
JINA_CONTRIB_MODULE           (unset)
JINA_CONTRIB_MODULE_IS_LOADING(unset)
JINA_CONTROL_PORT             (unset)
JINA_DEFAULT_HOST             (unset)
JINA_EXECUTOR_WORKDIR         (unset)
JINA_FULL_CLI                 (unset)
JINA_IPC_SOCK_TMP             (unset)
JINA_LOG_FILE                 (unset)
JINA_LOG_LONG                 (unset)
JINA_LOG_NO_COLOR             (unset)
JINA_LOG_PROFILING            (unset)
JINA_LOG_SSE                  (unset)
JINA_LOG_VERBOSITY            (unset)
JINA_POD_NAME                 (unset)
JINA_PROFILING                (unset)
JINA_SOCKET_HWM               (unset)
JINA_STACK_CONFIG             (unset)
JINA_TEST_CONTAINER           (unset)
JINA_TEST_GPU                 (unset)
JINA_TEST_PRETRAINED          (unset)
JINA_VCS_VERSION              (unset)
JINA_VERSION                  (unset)
JINA_WARN_UNNAMED             (unset)
JINA_BINARY_DELIMITER         (unset)
JINA_DISABLE_UVLOOP           (unset)

Regards!

JoanFM commented 4 years ago

Sure, this is the one with I ran the last indexing process. I think is the same that I've posted in the description:

$ jina --version-full
jina                          0.4.1   
jina-proto                    0.0.55  
jina-vcs-tag                  (unset)
libzmq                        4.3.2
pyzmq                         1.18.5
protobuf                      3.13.0
proto-backend                 cpp
grpcio                        1.31.0
ruamel.yaml                   0.16.10
python                        3.8.5
platform                      Linux
platform-release              4.14.146-93.123.amzn1.x86_64
platform-version              #1 SMP Tue Sep 24 00:45:23 UTC 2019
architecture                  x86_64
processor                     x86_64
jina-resources                /home/ec2-user/.pyenv/versions/3.8.5/envs/jina/lib/python3.8/site-packages/jina/resources
JINA_ARRAY_QUANT              (unset)
JINA_CONTRIB_MODULE           (unset)
JINA_CONTRIB_MODULE_IS_LOADING(unset)
JINA_CONTROL_PORT             (unset)
JINA_DEFAULT_HOST             (unset)
JINA_EXECUTOR_WORKDIR         (unset)
JINA_FULL_CLI                 (unset)
JINA_IPC_SOCK_TMP             (unset)
JINA_LOG_FILE                 (unset)
JINA_LOG_LONG                 (unset)
JINA_LOG_NO_COLOR             (unset)
JINA_LOG_PROFILING            (unset)
JINA_LOG_SSE                  (unset)
JINA_LOG_VERBOSITY            (unset)
JINA_POD_NAME                 (unset)
JINA_PROFILING                (unset)
JINA_SOCKET_HWM               (unset)
JINA_STACK_CONFIG             (unset)
JINA_TEST_CONTAINER           (unset)
JINA_TEST_GPU                 (unset)
JINA_TEST_PRETRAINED          (unset)
JINA_VCS_VERSION              (unset)
JINA_VERSION                  (unset)
JINA_WARN_UNNAMED             (unset)
JINA_BINARY_DELIMITER         (unset)
JINA_DISABLE_UVLOOP           (unset)

Regards!

Oh yes, sorry for that

JoanFM commented 4 years ago

Hello @jmferrete,

For what I can see int he logs, just 2 of the shards were properly closed (I assume those are the corrresponding to your ./data/doc_indexer-10 and ./data/doc_indexer-7.

I think that the issue is that before calling close on the executor we do some executor saving (all inside the same finally block), and this saving must be raising some exception that are not handled and avoids the actual closing of the index files.

We will try to do some work on that and try to provide a fix or a better way to identify the issue.

Regards,

Joan

Edit: This is wrong, I was confused, the 2 peas properly terminated were the Head and Tail Peas

jmferrete commented 4 years ago

Ok!, thanks.

JoanFM commented 4 years ago

Hello @jmferrete ,

I have been able to reproduce the issue, and I was wrong, no exception was raised when saving the data files.

There is a bug resulting from the presence of dumped executors from previous runs. If you make sure to clean your data before running the index flow it should be good to go.

Meanwhile we will work on solving this issue.

Thank you very much,

Regards

JoanFM commented 4 years ago

Hi @jmferrete

Besides what I said in the last message, I have two recommendations for your flow configuration and code.

Since there are no chunks involved in the process you should eliminate the ranker from the query flow, and in the app.py

you should change the line: text = match.meta_info.decode() for the line: text = match.text

I hope it helps

jmferrete commented 4 years ago

Thanks a lot @JoanFM , we will test again following your recommendations.

We will double check that we are cleaning the data folder before executing the index process.

jmferrete commented 4 years ago

Hi @JoanFM ,

there must be another thing or we are not cleaning the right folders.

I've tried again with a full indexation making sure that we are cleaning the data/ folder, and we have exactly the same issue:

$> rm -rf data/*
$> rm -rf pods/__pycache__
$> ls -ltra data/
total 8
drwxrwxr-x 6 jferrete jferrete 4096 ago 25 17:59 ..
drwxrwxr-x 2 jferrete jferrete 4096 ago 25 20:55 .
$> PREFETCH=5 PREFETCH_ON_RECV=1 BATCH_SIZE=50 PARALLEL=10 SHARDS=10 python app.py -t index 2>&1 |pigz -p $(nproc) > jina-ai-output-2.log.gz
$> ls -ltra data/
total 48
drwxrwxr-x  6 jferrete jferrete 4096 ago 25 17:59 ..
drwxrwxr-x 12 jferrete jferrete 4096 ago 25 20:57 .
drwxrwxr-x  2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-1
drwxrwxr-x  2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-3
drwxrwxr-x  2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-2
drwxrwxr-x  2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-4
drwxrwxr-x  2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-5
drwxrwxr-x  2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-6
drwxrwxr-x  2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-7
drwxrwxr-x  2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-8
drwxrwxr-x  2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-9
drwxrwxr-x  2 jferrete jferrete 4096 ago 25 21:01 doc_indexer-10
$> for f in $(find . -name "vec.gz"); do echo "Checking: $f" ; gzip -t $f ; done
Checking: ./data/doc_indexer-9/vec.gz
gzip: ./data/doc_indexer-9/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-1/vec.gz
gzip: ./data/doc_indexer-1/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-6/vec.gz
gzip: ./data/doc_indexer-6/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-10/vec.gz
gzip: ./data/doc_indexer-10/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-8/vec.gz
gzip: ./data/doc_indexer-8/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-7/vec.gz
gzip: ./data/doc_indexer-7/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-4/vec.gz
gzip: ./data/doc_indexer-4/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-2/vec.gz
gzip: ./data/doc_indexer-2/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-5/vec.gz
gzip: ./data/doc_indexer-5/vec.gz: unexpected end of file
Checking: ./data/doc_indexer-3/vec.gz
gzip: ./data/doc_indexer-3/vec.gz: unexpected end of file

As you can see, the problem persists. Are we missing something?, is there any other folder apart of data/ that we should clean?

Here you have the logs for this execution.

Thanks in advance.

hanxiao commented 4 years ago

Hi @jmferrete I just catch up on this thread, just FYI

Prefetch: 200 Batch size: 1000 Shards: 10 Parallel: 10

This means once fully running, you will have 200 * 1000 documents in memory at any given time.

I will read through the thread later

jmferrete commented 4 years ago

Thanks for the information @hanxiao.

We already decreased our batch size and prefetch settings since @JoanFM, commented here.

Our current index settings are these:

Prefetch: 5
Prefetch_on_recv: 1
Batch size: 50
Shards: 10
Parallel: 10

hanxiao commented 4 years ago

I'm marking this as high-prio now

sync-by-unito[bot] commented 4 years ago

➤ Lucia Loher commented:

So far; tested with 1 Mil. Documents, need to find good benchmark datasets

sync-by-unito[bot] commented 4 years ago

➤ Rutuja Surve commented:

We could have to define a new persistence strategy for saving the executors, by implementing get_state, especially applicable for huge datasets.

Also, a functionality that checks the consistency and validates the persistence by verifying check-sum of the persisted state with the expected state.

jmferrete commented 4 years ago

Hi,

We have some questions, have you tested this out in a large scale implementation? We will need to use Jina in a very large horizontal scaling implementation, and our idea is to be able to perform subsecond queries. Is this possible with Jina? Because we didn't found how to do this in your documentation.

Is there anything we can do for helping you with this issue?

Can we provide you some dataset so you can test with it in your side?

Thanks in advance.

hanxiao commented 4 years ago

Hi @jmferrete

Regarding the broken gz/pickle bug, I believe #907 fixed it and has been rolled out in 0.5.2. Let me recap what was the problem and how did we fix it.

Problem

First, this is a bug in our BasePea lifecycle implementation, previously our BasePea is set with daemon=True and the executor persistence logic is written in the finally block. Unfortunately, this finally block is not guaranteed to be finished due to the following event chain.

Flow terminates all contexts by implicitly calling __exit__ on every Pods. [main process]
Each Pod terminates all context by implicitly calling __exit__ on every Pea. [main process]
In __exit__ of the Pea, the [main process] sends a TERMINATE signal via zmq and asks the eventloop [the child process] to shutdown.
The child process receives the TERMINATE signal and sends back to tell the main process that it already closes all zmq context.
finally in the child process is called and the executor is now saved.

After step 4 is done, the main process considers the child process is done and immediately return __exit__ to the upper level regardless the running status of finally. In the end, the Flow terminates the main process, and since all child processes are daemon processes, they are all killed as well.

Your problem is due to the child process in BasePea is killed while writing executor in its finally, yielding a broken file.

Thanks for finding it out and as you can imagine this is a pretty serious problem, and it did take us some time to find the reason.

Solution

Having the problem diagnosed, #907 is mainly about fixing the lifecycle and optimizing the shutdown logics:

BasePea child process is not daemon anymore
In __exit__, we use self.join() to wait the lifecycle of a BasePea is finally ends. This guarantees that __exit__ is only returned after finally block is finished.
We moved the executor saving logic out of finally block as finally is not a 100% secure place to put teardown logic.

Hope that clarifies.

Aftermath

You can now use pip install -U jina for the upgrade. @JoanFM may also add an integrated test for this to ensure this never comes up again.

jmferrete commented 4 years ago

Thanks for the explaining and for the fast resolution.

We will now be able to continue doing our tests with Jina.

I will comment if I find something else.

Regards.

jina-ai / serve

Stress Test in CICD with big datasets #867

Problem

Solution

Aftermath