leo-project / leofs

The LeoFS Storage System
https://leo-project.net/leofs/
Apache License 2.0
1.55k stars 155 forks source link

Deleting bucket eventually fails and makes delete queues stuck #725

Open vstax opened 7 years ago

vstax commented 7 years ago

I got test cluster (1.3.4, 3 storage nodes, N=2, W=1). There are two buckets, "body" and "bodytest", each containing the same objects, about 1M in each (there are some other buckets as well, but they hardly contain anything). In other words, there are slightly over 2M objects in cluster in total. At the start of this test the data is fully consistent. There is some minor load on cluster with "body" buckets - some PUT & GET operations, but very few of them. No one tries to access "bodytest" bucket.

I want to remove "bodytest" with all its objects. I execute s3cmd rb s3://bodytest. I see load on gateway and storage nodes; after some time, s3cmd fails because of timeout (I expect this to happen, no way storage can find all 1M objects and mark them as deleted fast enough). I see leo_async_deletion_queue queues growing on storage nodes:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue leo_async_deletion_queue | idling | 97845 | 0 | 3000 | async deletion of objs [root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue leo_async_deletion_queue | idling | 102780 | 0 | 3000 | async deletion of objs [root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue leo_async_deletion_queue | idling | 104911 | 0 | 3000 | async deletion of objs [root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue leo_async_deletion_queue | idling | 108396 | 0 | 3000 | async deletion of objs

The same for storage_0 and storage_1. There is hardly any disk load, each storage nodes consumes 120-130% CPU as per top.

Then some errors appear in error log on gateway_0:

[W] gateway_0@192.168.3.52  2017-05-04 17:15:53.998704 +0300    1493907353  leo_gateway_s3_api:delete_bucket_2/3    1774    [{cause,timeout}]
[W] gateway_0@192.168.3.52  2017-05-04 17:15:58.999759 +0300    1493907358  leo_gateway_s3_api:delete_bucket_2/3    1774    [{cause,timeout}]
[W] gateway_0@192.168.3.52  2017-05-04 17:16:07.11702 +0300 1493907367  leo_gateway_s3_api:delete_bucket_2/3    1774    [{cause,timeout}]
[W] gateway_0@192.168.3.52  2017-05-04 17:16:12.12715 +0300 1493907372  leo_gateway_s3_api:delete_bucket_2/3    1774    [{cause,timeout}]
[W] gateway_0@192.168.3.52  2017-05-04 17:16:23.48706 +0300 1493907383  leo_gateway_s3_api:delete_bucket_2/3    1774    [{cause,timeout}]
[W] gateway_0@192.168.3.52  2017-05-04 17:16:28.49750 +0300 1493907388  leo_gateway_s3_api:delete_bucket_2/3    1774    [{cause,timeout}]
[W] gateway_0@192.168.3.52  2017-05-04 17:17:01.702719 +0300    1493907421  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_0@192.168.3.53'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_0@192.168.3.52  2017-05-04 17:17:06.703840 +0300    1493907426  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_1@192.168.3.54'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]

If these errors mean that gateway sent "timeout" to the client that requested "delete bucket" operation, plus some other timeouts due to load on system - then it's within expectations; as long as all data from that bucket will eventually be marked as "deleted" asynchronously, all is fine.

That's not what happens, however. At some point - few minutes after the "delete bucket" operation - delete queues stop growing or reducing. It's as if they are stuck. Here is their current state - 1.5 hours after the experiment; they got to that state within 5-10 minutes after start of experiment and never changed since (I show only one queue here, others are empty):

[root@leo-m0 app]# /usr/local/bin/leofs-adm mq-stats storage_0@192.168.3.53|grep leo_async_deletion_queue
 leo_async_deletion_queue       |   idling    | 0              | 1600           | 500            | async deletion of objs
[root@leo-m0 app]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_queue
 leo_async_deletion_queue       |   idling    | 53559          | 0              | 3000           | async deletion of objs
[root@leo-m0 app]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue
 leo_async_deletion_queue       |   idling    | 136972         | 0              | 3000           | async deletion of objs

There is nothing in logs of manager nodes. There is nothing in erlang.log files on storage nodes (no mention of restarts or anything). Logs on storage nodes, info log for storage_0:

[I] storage_0@192.168.3.53  2017-05-04 17:16:06.79494 +0300 1493907366  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,17086}]
[I] storage_0@192.168.3.53  2017-05-04 17:16:06.80095 +0300 1493907366  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,14217}]
[I] storage_0@192.168.3.53  2017-05-04 17:16:06.80515 +0300 1493907366  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/3e/b3/bd/3eb3bde5f5e58a67db86dbea8dd6850810fabdf41ee1e65ba1dd8395279175259b0fc7cf9b4b60f1cbc200d2d8bd541e00e6000000000000.xz">>},{processing_time,12232}]
[I] storage_0@192.168.3.53  2017-05-04 17:16:24.135151 +0300    1493907384  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,30141}]
[I] storage_0@192.168.3.53  2017-05-04 17:16:34.277504 +0300    1493907394  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,28198}]
[I] storage_0@192.168.3.53  2017-05-04 17:16:34.277892 +0300    1493907394  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/0a/59/ab/0a59aba721c409c8f9bf0bba176d10242380842653e6994f782fbe31cb2296b46cba031a085b4af057ab43314631e3691c5c000000000000.xz">>},{processing_time,24674}]
[I] storage_0@192.168.3.53  2017-05-04 17:16:34.280088 +0300    1493907394  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/25/03/16/250316ef50b26272f99b757409d75c173135d2ef09d972821072348ad071e49897dd7245c1f250db6489a401aea567d9886e000000000000.xz">>},{processing_time,5303}]
[I] storage_0@192.168.3.53  2017-05-04 17:16:43.179282 +0300    1493907403  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,41173}]
[I] storage_0@192.168.3.53  2017-05-04 17:16:43.179708 +0300    1493907403  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,18328}]
[I] storage_0@192.168.3.53  2017-05-04 17:16:43.180082 +0300    1493907403  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,37101}]
[I] storage_0@192.168.3.53  2017-05-04 17:16:43.180461 +0300    1493907403  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/3e/b3/bd/3eb3bde5f5e58a67db86dbea8dd6850810fabdf41ee1e65ba1dd8395279175259b0fc7cf9b4b60f1cbc200d2d8bd541e00e6000000000000.xz">>},{processing_time,37100}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:03.11597 +0300 1493907423  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,28734}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:03.12445 +0300 1493907423  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/04/12/d6/0412d6227769cfef42764802004d465e83b309d884ec841d330ede99a2a551eda52ddda2b9774531e88b0b060dbb3a17c092000000000000.xz">>},{processing_time,27558}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:03.12986 +0300 1493907423  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/03/68/ed/0368ed293eded8de34b8728325c08f1603edcede7e8de7778c81d558b5b0c9cd9f85307d1c571b9e549ef92e9ec69498005a7b0000000000.xz\n1">>},{processing_time,8691}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:03.809203 +0300    1493907423  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,56801}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:03.809742 +0300    1493907423  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/05/67/2b/05672be039f98d72ef426d413ae66d6aa33b472625a448879569a25ca29cdb8699580886a9759e357e97cc834bef15dd84d2000000000000.xz">>},{processing_time,9943}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:32.116059 +0300    1493907452  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,74073}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:32.116367 +0300    1493907452  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/34/86/01/348601a3d08bf38a4cb5bd8e22dae951d689def13b7bd1cc9c08cd0200a3cd52c6e015bf3be722d810a94f132752faf278c3000000000000.xz">>},{processing_time,36264}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:32.116743 +0300    1493907452  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/4e/cc/ef/4eccef1b917e48d1df702faab63181162c7a8f67998d7b5ef11ac33940ffe6362a8d1671c5e5f2c39945669b1d04f1ef0027720000000000.xz\n1">>},{processing_time,14748}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:32.117065 +0300    1493907452  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/66/06/b6/6606b6f9fab7e81872e8a628b696634b0c25294b0a79fbaf03c10ec49aff1d44a0921efcba5a53127e84d83edf7206d758ac850000000000.xz\n1">>},{processing_time,13881}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:33.13033 +0300 1493907453  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,30002}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:36.111764 +0300    1493907456  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/06/04/2d/06042debc9eb14bd654582d79019c02698f9514ae73709dda7f6a614868d294819908ad755cf814fb43059560fe2f0c984c3010000000000.xz">>},{processing_time,30001}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:40.473183 +0300    1493907460  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/dd/f3/60/ddf360276b8ccf938a0cfdb8f260bdf2acdcd85b9cf1e2c8f8d3b1b2d0ad17554e8a1d7d2490d396e7ad08532c9e90ac7cf2040100000000.xz\n1">>},{processing_time,16375}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:40.478082 +0300    1493907460  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/2c/5a/d2/2c5ad26023e73f37b47ee824eea753e550b99dc2945281102253e12b88c122dfbc7fcdad9706e0ee6d0dc19e86d10b76a8277f0000000000.xz\n2">>},{processing_time,7066}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.502257 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,84458}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.503120 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/33/8b/4c/338b4c25ca8fbb66a86f24bfc302f2fa4a9c657074c14e41692e5864a121849c4ad9a0f7342a35a16fc906d159980560782b010000000000.xz">>},{processing_time,20645}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.503488 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/0e/6e/8b/0e6e8bcdbc732024193f7114b3f7d607333a9d3212a71e7104aea2b2b3bc137514eadd9c4d7de516e345feb9764186d9389d000000000000.xz">>},{processing_time,22633}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.503863 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/e3/32/e0/e332e00f0f77cf322cbc1d7a30369f8681073076c49868ab9cd5cee6043dfe3ebc8c355ae5899b74602ba763dcba872450bc560000000000.xz\n1">>},{processing_time,5896}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.521029 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,19185}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.521524 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/06/56/e3/0656e37a1ff09fb11abf969cb0b795905d3ed78087be15c01ca8e5b840395ca076c82eae02ba6e0f84f9d90dbf0f3a300600100000000000.xz">>},{processing_time,83386}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.521894 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,19168}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.522149 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,64342}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.522401 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,18958}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.522652 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,18803}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.522912 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,14251}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.524355 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,50816}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.525083 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,45786}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.526651 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/05/67/2b/05672be039f98d72ef426d413ae66d6aa33b472625a448879569a25ca29cdb8699580886a9759e357e97cc834bef15dd84d2000000000000.xz">>},{processing_time,43717}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.527288 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,22024}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.527732 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/4e/cc/ef/4eccef1b917e48d1df702faab63181162c7a8f67998d7b5ef11ac33940ffe6362a8d1671c5e5f2c39945669b1d04f1ef0027720000000000.xz\n1">>},{processing_time,15411}]
[I] storage_0@192.168.3.53  2017-05-04 17:17:47.528128 +0300    1493907467  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/66/06/b6/6606b6f9fab7e81872e8a628b696634b0c25294b0a79fbaf03c10ec49aff1d44a0921efcba5a53127e84d83edf7206d758ac850000000000.xz\n1">>},{processing_time,15411}]

error log on storage_0:

[W] storage_0@192.168.3.53  2017-05-04 17:16:23.854142 +0300    1493907383  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/3e/b3/bd/3eb3bde5f5e58a67db86dbea8dd6850810fabdf41ee1e65ba1dd8395279175259b0fc7cf9b4b60f1cbc200d2d8bd541e00e6000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-05-04 17:16:24.850092 +0300    1493907384  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/3e/b3/bd/3eb3bde5f5e58a67db86dbea8dd6850810fabdf41ee1e65ba1dd8395279175259b0fc7cf9b4b60f1cbc200d2d8bd541e00e6000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-05-04 17:16:54.851871 +0300    1493907414  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-05-04 17:16:55.851836 +0300    1493907415  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-05-04 17:17:25.853134 +0300    1493907445  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/34/86/01/348601a3d08bf38a4cb5bd8e22dae951d689def13b7bd1cc9c08cd0200a3cd52c6e015bf3be722d810a94f132752faf278c3000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-05-04 17:17:26.855813 +0300    1493907446  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/34/86/01/348601a3d08bf38a4cb5bd8e22dae951d689def13b7bd1cc9c08cd0200a3cd52c6e015bf3be722d810a94f132752faf278c3000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-05-04 17:17:36.117939 +0300    1493907456  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/06/04/2d/06042debc9eb14bd654582d79019c02698f9514ae73709dda7f6a614868d294819908ad755cf814fb43059560fe2f0c984c3010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-05-04 17:17:37.113194 +0300    1493907457  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/06/04/2d/06042debc9eb14bd654582d79019c02698f9514ae73709dda7f6a614868d294819908ad755cf814fb43059560fe2f0c984c3010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-05-04 17:17:47.534739 +0300    1493907467  leo_storage_read_repairer:compare/4 165 [{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907416708152},{cause,primary_inconsistency}]
[W] storage_0@192.168.3.53  2017-05-04 17:17:47.538512 +0300    1493907467  leo_storage_read_repairer:compare/4 165 [{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907421704031},{cause,primary_inconsistency}]
[W] storage_0@192.168.3.53  2017-05-04 17:17:47.542151 +0300    1493907467  leo_storage_read_repairer:compare/4 165 [{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907421704031},{cause,primary_inconsistency}]
[W] storage_0@192.168.3.53  2017-05-04 17:17:47.549344 +0300    1493907467  leo_storage_read_repairer:compare/4 165 [{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907421704031},{cause,primary_inconsistency}]

Info log on storage_1:

[I] storage_1@192.168.3.54  2017-05-04 17:16:10.725946 +0300    1493907370  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,21667}]
[I] storage_1@192.168.3.54  2017-05-04 17:16:20.764386 +0300    1493907380  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,26764}]
[I] storage_1@192.168.3.54  2017-05-04 17:16:37.95550 +0300 1493907397  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,35064}]
[I] storage_1@192.168.3.54  2017-05-04 17:16:47.109806 +0300    1493907407  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,40093}]
[I] storage_1@192.168.3.54  2017-05-04 17:17:03.480048 +0300    1493907423  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,45433}]
[I] storage_1@192.168.3.54  2017-05-04 17:17:03.480713 +0300    1493907423  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/f0/98/63/f0986371cf98c032e6c870ae6f4a26fac08ae91958c14c86478b39b758ea58953095a32412961708a0a9090e0d2da4edf615cc0100000000.xz\n1">>},{processing_time,5704}]
[I] storage_1@192.168.3.54  2017-05-04 17:17:13.497836 +0300    1493907433  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,50442}]
[I] storage_1@192.168.3.54  2017-05-04 17:17:13.503749 +0300    1493907433  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/4e/18/4f/4e184f78326e5665991244965ecf1a3bca129ed4353adb0d8cc63a5c7d8a7a49a7ade04120ba3a5c75e18c5be2da79ffa829a40000000000.xz\n2">>},{processing_time,8485}]
[I] storage_1@192.168.3.54  2017-05-04 17:17:13.637206 +0300    1493907433  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,16918}]
[I] storage_1@192.168.3.54  2017-05-04 17:17:13.641295 +0300    1493907433  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,11937}]
[I] storage_1@192.168.3.54  2017-05-04 17:17:13.660910 +0300    1493907433  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/f0/98/63/f0986371cf98c032e6c870ae6f4a26fac08ae91958c14c86478b39b758ea58953095a32412961708a0a9090e0d2da4edf615cc0100000000.xz\n1">>},{processing_time,10180}]

error log on storage_1:

[E] storage_1@192.168.3.54  2017-05-04 17:16:10.720827 +0300    1493907370  leo_backend_db_eleveldb:prefix_search/3222  {timeout,{gen_server,call,[leo_async_deletion_queue_1,{enqueue,<<131,104,2,110,16,0,38,115,152,115,44,32,50,32,91,246,196,247,235,102,48,217,109,0,0,0,133,98,111,100,121,116,101,115,116,47,57,54,47,54,50,47,97,57,47,57,54,54,50,97,57,57,57,51,51,50,49,51,54,52,53,100,48,50,54,102,51,57,56,53,56,97,48,50,51,99,50,100,48,54,100,101,50,51,98,55,101,56,48,53,52,52,56,48,51,102,48,50,98,100,50,51,52,49,98,102,53,53,102,55,48,54,56,50,100,100,99,54,51,102,52,53,52,52,55,48,99,49,51,102,99,100,48,101,51,51,100,50,52,55,102,48,52,48,48,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>,<<131,104,6,100,0,22,97,115,121,110,99,95,100,101,108,101,116,105,111,110,95,109,101,115,115,97,103,101,110,7,0,217,31,144,105,179,78,5,110,16,0,38,115,152,115,44,32,50,32,91,246,196,247,235,102,48,217,109,0,0,0,133,98,111,100,121,116,101,115,116,47,57,54,47,54,50,47,97,57,47,57,54,54,50,97,57,57,57,51,51,50,49,51,54,52,53,100,48,50,54,102,51,57,56,53,56,97,48,50,51,99,50,100,48,54,100,101,50,51,98,55,101,56,48,53,52,52,56,48,51,102,48,50,98,100,50,51,52,49,98,102,53,53,102,55,48,54,56,50,100,100,99,54,51,102,52,53,52,52,55,48,99,49,51,102,99,100,48,101,51,51,100,50,52,55,102,48,52,48,48,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122,110,5,0,160,179,127,210,14,97,0>>},10000]}}
[E] storage_1@192.168.3.54  2017-05-04 17:16:20.760699 +0300    1493907380  leo_backend_db_eleveldb:prefix_search/3222  {timeout,{gen_server,call,[leo_async_deletion_queue_1,{enqueue,<<131,104,2,110,16,0,166,167,75,13,81,196,48,74,78,229,160,88,48,152,97,28,109,0,0,0,133,98,111,100,121,116,101,115,116,47,48,48,47,48,48,47,48,56,47,48,48,48,48,48,56,55,51,57,57,102,53,55,98,56,53,54,100,57,52,56,50,99,101,53,50,57,50,55,56,99,99,100,53,100,50,55,101,98,51,101,49,49,57,98,53,97,48,99,50,102,99,98,56,57,49,97,52,55,102,100,48,49,56,49,56,51,98,56,51,100,102,101,55,53,99,99,102,51,54,97,53,57,55,101,48,50,101,98,56,50,57,52,54,100,99,48,48,53,97,48,101,48,48,48,48,48,48,48,48,48,48,46,120,122>>,<<131,104,6,100,0,22,97,115,121,110,99,95,100,101,108,101,116,105,111,110,95,109,101,115,115,97,103,101,110,7,0,32,17,41,106,179,78,5,110,16,0,166,167,75,13,81,196,48,74,78,229,160,88,48,152,97,28,109,0,0,0,133,98,111,100,121,116,101,115,116,47,48,48,47,48,48,47,48,56,47,48,48,48,48,48,56,55,51,57,57,102,53,55,98,56,53,54,100,57,52,56,50,99,101,53,50,57,50,55,56,99,99,100,53,100,50,55,101,98,51,101,49,49,57,98,53,97,48,99,50,102,99,98,56,57,49,97,52,55,102,100,48,49,56,49,56,51,98,56,51,100,102,101,55,53,99,99,102,51,54,97,53,57,55,101,48,50,101,98,56,50,57,52,54,100,99,48,48,53,97,48,101,48,48,48,48,48,48,48,48,48,48,46,120,122,110,5,0,170,179,127,210,14,97,0>>},10000]}}
[E] storage_1@192.168.3.54  2017-05-04 17:16:37.94918 +0300 1493907397  leo_backend_db_eleveldb:prefix_search/3 222 {timeout,{gen_server,call,[leo_async_deletion_queue_1,{enqueue,<<131,104,2,110,16,0,205,38,194,2,251,99,149,185,246,131,149,156,96,116,90,188,109,0,0,0,133,98,111,100,121,116,101,115,116,47,55,56,47,56,56,47,102,97,47,55,56,56,56,102,97,97,48,54,54,101,50,98,54,56,52,54,99,57,98,56,52,51,55,102,52,99,100,55,98,97,53,55,100,99,52,51,55,98,100,100,98,99,51,98,56,53,51,101,101,100,48,53,98,101,57,56,101,48,97,97,99,49,98,97,97,51,51,57,52,101,55,48,55,48,98,48,101,57,98,49,99,101,99,57,99,98,99,57,101,49,50,55,54,99,54,97,56,97,97,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>,<<131,104,6,100,0,22,97,115,121,110,99,95,100,101,108,101,116,105,111,110,95,109,101,115,115,97,103,101,110,7,0,184,132,34,107,179,78,5,110,16,0,205,38,194,2,251,99,149,185,246,131,149,156,96,116,90,188,109,0,0,0,133,98,111,100,121,116,101,115,116,47,55,56,47,56,56,47,102,97,47,55,56,56,56,102,97,97,48,54,54,101,50,98,54,56,52,54,99,57,98,56,52,51,55,102,52,99,100,55,98,97,53,55,100,99,52,51,55,98,100,100,98,99,51,98,56,53,51,101,101,100,48,53,98,101,57,56,101,48,97,97,99,49,98,97,97,51,51,57,52,101,55,48,55,48,98,48,101,57,98,49,99,101,99,57,99,98,99,57,101,49,50,55,54,99,54,97,56,97,97,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122,110,5,0,187,179,127,210,14,97,0>>},10000]}}
[E] storage_1@192.168.3.54  2017-05-04 17:16:47.108568 +0300    1493907407  leo_backend_db_eleveldb:prefix_search/3222  {timeout,{gen_server,call,[leo_async_deletion_queue_1,{enqueue,<<131,104,2,110,16,0,166,167,75,13,81,196,48,74,78,229,160,88,48,152,97,28,109,0,0,0,133,98,111,100,121,116,101,115,116,47,48,48,47,48,48,47,48,56,47,48,48,48,48,48,56,55,51,57,57,102,53,55,98,56,53,54,100,57,52,56,50,99,101,53,50,57,50,55,56,99,99,100,53,100,50,55,101,98,51,101,49,49,57,98,53,97,48,99,50,102,99,98,56,57,49,97,52,55,102,100,48,49,56,49,56,51,98,56,51,100,102,101,55,53,99,99,102,51,54,97,53,57,55,101,48,50,101,98,56,50,57,52,54,100,99,48,48,53,97,48,101,48,48,48,48,48,48,48,48,48,48,46,120,122>>,<<131,104,6,100,0,22,97,115,121,110,99,95,100,101,108,101,116,105,111,110,95,109,101,115,115,97,103,101,110,7,0,23,93,187,107,179,78,5,110,16,0,166,167,75,13,81,196,48,74,78,229,160,88,48,152,97,28,109,0,0,0,133,98,111,100,121,116,101,115,116,47,48,48,47,48,48,47,48,56,47,48,48,48,48,48,56,55,51,57,57,102,53,55,98,56,53,54,100,57,52,56,50,99,101,53,50,57,50,55,56,99,99,100,53,100,50,55,101,98,51,101,49,49,57,98,53,97,48,99,50,102,99,98,56,57,49,97,52,55,102,100,48,49,56,49,56,51,98,56,51,100,102,101,55,53,99,99,102,51,54,97,53,57,55,101,48,50,101,98,56,50,57,52,54,100,99,48,48,53,97,48,101,48,48,48,48,48,48,48,48,48,48,46,120,122,110,5,0,197,179,127,210,14,97,0>>},10000]}}
[E] storage_1@192.168.3.54  2017-05-04 17:17:03.478769 +0300    1493907423  leo_backend_db_eleveldb:prefix_search/3222  {timeout,{gen_server,call,[leo_async_deletion_queue_1,{enqueue,<<131,104,2,110,16,0,37,221,237,84,123,140,135,76,39,216,128,38,178,216,253,43,109,0,0,0,133,98,111,100,121,116,101,115,116,47,52,56,47,54,100,47,53,102,47,52,56,54,100,53,102,98,51,98,55,55,52,99,56,97,102,51,97,102,54,102,100,48,51,53,102,51,54,55,98,100,48,52,52,100,98,100,56,97,97,55,49,102,51,52,98,54,51,53,51,53,57,99,57,57,102,56,48,55,101,54,51,102,98,99,54,52,48,97,53,99,56,97,98,99,50,49,51,99,52,50,49,52,51,100,52,55,101,98,57,101,55,49,101,48,99,48,52,48,48,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>,<<131,104,6,100,0,22,97,115,121,110,99,95,100,101,108,101,116,105,111,110,95,109,101,115,115,97,103,101,110,7,0,76,233,180,108,179,78,5,110,16,0,37,221,237,84,123,140,135,76,39,216,128,38,178,216,253,43,109,0,0,0,133,98,111,100,121,116,101,115,116,47,52,56,47,54,100,47,53,102,47,52,56,54,100,53,102,98,51,98,55,55,52,99,56,97,102,51,97,102,54,102,100,48,51,53,102,51,54,55,98,100,48,52,52,100,98,100,56,97,97,55,49,102,51,52,98,54,51,53,51,53,57,99,57,57,102,56,48,55,101,54,51,102,98,99,54,52,48,97,53,99,56,97,98,99,50,49,51,99,52,50,49,52,51,100,52,55,101,98,57,101,55,49,101,48,99,48,52,48,48,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122,110,5,0,213,179,127,210,14,97,0>>},10000]}}
[E] storage_1@192.168.3.54  2017-05-04 17:17:13.490725 +0300    1493907433  leo_backend_db_eleveldb:prefix_search/3222  {timeout,{gen_server,call,[leo_async_deletion_queue_1,{enqueue,<<131,104,2,110,16,0,166,167,75,13,81,196,48,74,78,229,160,88,48,152,97,28,109,0,0,0,133,98,111,100,121,116,101,115,116,47,48,48,47,48,48,47,48,56,47,48,48,48,48,48,56,55,51,57,57,102,53,55,98,56,53,54,100,57,52,56,50,99,101,53,50,57,50,55,56,99,99,100,53,100,50,55,101,98,51,101,49,49,57,98,53,97,48,99,50,102,99,98,56,57,49,97,52,55,102,100,48,49,56,49,56,51,98,56,51,100,102,101,55,53,99,99,102,51,54,97,53,57,55,101,48,50,101,98,56,50,57,52,54,100,99,48,48,53,97,48,101,48,48,48,48,48,48,48,48,48,48,46,120,122>>,<<131,104,6,100,0,22,97,115,121,110,99,95,100,101,108,101,116,105,111,110,95,109,101,115,115,97,103,101,110,7,0,225,206,77,109,179,78,5,110,16,0,166,167,75,13,81,196,48,74,78,229,160,88,48,152,97,28,109,0,0,0,133,98,111,100,121,116,101,115,116,47,48,48,47,48,48,47,48,56,47,48,48,48,48,48,56,55,51,57,57,102,53,55,98,56,53,54,100,57,52,56,50,99,101,53,50,57,50,55,56,99,99,100,53,100,50,55,101,98,51,101,49,49,57,98,53,97,48,99,50,102,99,98,56,57,49,97,52,55,102,100,48,49,56,49,56,51,98,56,51,100,102,101,55,53,99,99,102,51,54,97,53,57,55,101,48,50,101,98,56,50,57,52,54,100,99,48,48,53,97,48,101,48,48,48,48,48,48,48,48,48,48,46,120,122,110,5,0,223,179,127,210,14,97,0>>},10000]}}

Info log on storage_2:

[I] storage_2@192.168.3.55  2017-05-04 17:16:12.956911 +0300    1493907372  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,23920}]
[I] storage_2@192.168.3.55  2017-05-04 17:16:12.958225 +0300    1493907372  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,21096}]
[I] storage_2@192.168.3.55  2017-05-04 17:16:12.958522 +0300    1493907372  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/3e/b3/bd/3eb3bde5f5e58a67db86dbea8dd6850810fabdf41ee1e65ba1dd8395279175259b0fc7cf9b4b60f1cbc200d2d8bd541e00e6000000000000.xz">>},{processing_time,19109}]
[I] storage_2@192.168.3.55  2017-05-04 17:16:35.444648 +0300    1493907395  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,41450}]
[I] storage_2@192.168.3.55  2017-05-04 17:16:35.445099 +0300    1493907395  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/06/56/e3/0656e37a1ff09fb11abf969cb0b795905d3ed78087be15c01ca8e5b840395ca076c82eae02ba6e0f84f9d90dbf0f3a300600100000000000.xz">>},{processing_time,12582}]
[I] storage_2@192.168.3.55  2017-05-04 17:16:35.445427 +0300    1493907395  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,10552}]
[I] storage_2@192.168.3.55  2017-05-04 17:16:42.958232 +0300    1493907402  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I] storage_2@192.168.3.55  2017-05-04 17:16:53.6668 +0300  1493907413  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/25/03/16/250316ef50b26272f99b757409d75c173135d2ef09d972821072348ad071e49897dd7245c1f250db6489a401aea567d9886e000000000000.xz">>},{processing_time,24030}]
[I] storage_2@192.168.3.55  2017-05-04 17:16:53.7199 +0300  1493907413  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/1a/82/8c/1a828c7a9d7a334714f91a8d1c56a4ec30a8dd4998c9db79f9dfed87be084a73aa090513d535e36186a986822b1d6ca9bc74010000000000.xz">>},{processing_time,18711}]
[I] storage_2@192.168.3.55  2017-05-04 17:16:55.327215 +0300    1493907415  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,53321}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:16.830422 +0300    1493907436  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,69821}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:16.830874 +0300    1493907436  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/34/86/01/348601a3d08bf38a4cb5bd8e22dae951d689def13b7bd1cc9c08cd0200a3cd52c6e015bf3be722d810a94f132752faf278c3000000000000.xz">>},{processing_time,20975}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:16.831683 +0300    1493907436  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/f0/98/63/f0986371cf98c032e6c870ae6f4a26fac08ae91958c14c86478b39b758ea58953095a32412961708a0a9090e0d2da4edf615cc0100000000.xz\n1">>},{processing_time,19056}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:16.832010 +0300    1493907436  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/4e/18/4f/4e184f78326e5665991244965ecf1a3bca129ed4353adb0d8cc63a5c7d8a7a49a7ade04120ba3a5c75e18c5be2da79ffa829a40000000000.xz\n2">>},{processing_time,11818}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:16.832350 +0300    1493907436  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,63874}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:16.832687 +0300    1493907436  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/3e/b3/bd/3eb3bde5f5e58a67db86dbea8dd6850810fabdf41ee1e65ba1dd8395279175259b0fc7cf9b4b60f1cbc200d2d8bd541e00e6000000000000.xz">>},{processing_time,63874}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:34.514878 +0300    1493907454  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,76471}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:34.515241 +0300    1493907454  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/4e/cc/ef/4eccef1b917e48d1df702faab63181162c7a8f67998d7b5ef11ac33940ffe6362a8d1671c5e5f2c39945669b1d04f1ef0027720000000000.xz\n1">>},{processing_time,17146}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:34.515530 +0300    1493907454  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/33/8b/4c/338b4c25ca8fbb66a86f24bfc302f2fa4a9c657074c14e41692e5864a121849c4ad9a0f7342a35a16fc906d159980560782b010000000000.xz">>},{processing_time,7655}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:46.878082 +0300    1493907466  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,fetch},{key,<<"bodytest">>},{processing_time,83833}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:46.878513 +0300    1493907466  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/0e/6e/8b/0e6e8bcdbc732024193f7114b3f7d607333a9d3212a71e7104aea2b2b3bc137514eadd9c4d7de516e345feb9764186d9389d000000000000.xz">>},{processing_time,22009}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:46.878950 +0300    1493907466  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/17/f5/9f/17f59f95c7bd20ea310cf7bd14d0c2cc9890444c621b859e03f879ccf2700936abeafbd3d62deee9ed2e58bfa86107e4cea8040100000000.xz\n3">>},{processing_time,8458}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:46.879426 +0300    1493907466  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,head},{key,<<"bodytest/e3/32/e0/e332e00f0f77cf322cbc1d7a30369f8681073076c49868ab9cd5cee6043dfe3ebc8c355ae5899b74602ba763dcba872450bc560000000000.xz\n1">>},{processing_time,5269}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:46.879704 +0300    1493907466  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/06/56/e3/0656e37a1ff09fb11abf969cb0b795905d3ed78087be15c01ca8e5b840395ca076c82eae02ba6e0f84f9d90dbf0f3a300600100000000000.xz">>},{processing_time,71434}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:46.880035 +0300    1493907466  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{processing_time,71433}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:46.880362 +0300    1493907466  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/05/67/2b/05672be039f98d72ef426d413ae66d6aa33b472625a448879569a25ca29cdb8699580886a9759e357e97cc834bef15dd84d2000000000000.xz">>},{processing_time,51552}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:46.881471 +0300    1493907466  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/34/86/01/348601a3d08bf38a4cb5bd8e22dae951d689def13b7bd1cc9c08cd0200a3cd52c6e015bf3be722d810a94f132752faf278c3000000000000.xz">>},{processing_time,30049}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:46.881907 +0300    1493907466  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/f0/98/63/f0986371cf98c032e6c870ae6f4a26fac08ae91958c14c86478b39b758ea58953095a32412961708a0a9090e0d2da4edf615cc0100000000.xz\n1">>},{processing_time,30050}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:46.882233 +0300    1493907466  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/4e/18/4f/4e184f78326e5665991244965ecf1a3bca129ed4353adb0d8cc63a5c7d8a7a49a7ade04120ba3a5c75e18c5be2da79ffa829a40000000000.xz\n2">>},{processing_time,30050}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:46.886477 +0300    1493907466  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/4e/cc/ef/4eccef1b917e48d1df702faab63181162c7a8f67998d7b5ef11ac33940ffe6362a8d1671c5e5f2c39945669b1d04f1ef0027720000000000.xz\n1">>},{processing_time,12372}]
[I] storage_2@192.168.3.55  2017-05-04 17:17:46.887370 +0300    1493907466  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,delete},{key,<<"bodytest/33/8b/4c/338b4c25ca8fbb66a86f24bfc302f2fa4a9c657074c14e41692e5864a121849c4ad9a0f7342a35a16fc906d159980560782b010000000000.xz">>},{processing_time,12373}]

Error log on storage_2:

[W] storage_2@192.168.3.55  2017-05-04 17:16:21.862361 +0300    1493907381  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-05-04 17:16:22.862268 +0300    1493907382  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/1a/11/87/1a118728e175f40a10b6390f3f579bfd3a5754401763708c8ef8f0b3bd9e5d84fbdcbb167fa850291032fcbbcd4439ef28d4000000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-05-04 17:16:52.865051 +0300    1493907412  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/06/56/e3/0656e37a1ff09fb11abf969cb0b795905d3ed78087be15c01ca8e5b840395ca076c82eae02ba6e0f84f9d90dbf0f3a300600100000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-05-04 17:16:53.865848 +0300    1493907413  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/06/56/e3/0656e37a1ff09fb11abf969cb0b795905d3ed78087be15c01ca8e5b840395ca076c82eae02ba6e0f84f9d90dbf0f3a300600100000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-05-04 17:17:23.868648 +0300    1493907443  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/05/67/2b/05672be039f98d72ef426d413ae66d6aa33b472625a448879569a25ca29cdb8699580886a9759e357e97cc834bef15dd84d2000000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-05-04 17:17:24.867539 +0300    1493907444  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/05/67/2b/05672be039f98d72ef426d413ae66d6aa33b472625a448879569a25ca29cdb8699580886a9759e357e97cc834bef15dd84d2000000000000.xz">>},{cause,timeout}]

To summarize the problems:

  1. Timeouts on gateway - not really a problem, as long as operation goes on asynchronously
  2. Typos "mehtod,delete", "mehtod,head", "mehtod,fetch" in info log. Note that it's correct in error log :)
  3. The fact that delete operation did not complete (I have picked a ~4300 random object names and executed "whereis" for them; around 1750 of them was marked as "deleted" on all nodes and around 2500 weren't deleted on any of them).
  4. The fact that delete queues got stuck. How do I "unfreeze" them? Reboot storage nodes? (not a problem, I'm just keeping them like that for now in case there is something else to try). There no errors or anything right now (however, debug logs are no enabled); state of all nodes is "running", but delete queue is not being processed on storage_1 and storage_2.
  5. These lines in log of storage_1
    [I] storage_1@192.168.3.54  2017-05-04 17:17:13.637206 +0300    1493907433  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,16918}]
    [I] storage_1@192.168.3.54  2017-05-04 17:17:13.641295 +0300    1493907433  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{mehtod,put},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{processing_time,11937}]

    and on storage_0:

    [W] storage_0@192.168.3.53  2017-05-04 17:17:47.534739 +0300    1493907467  leo_storage_read_repairer:compare/4 165 [{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907416708152},{cause,primary_inconsistency}]
    [W] storage_0@192.168.3.53  2017-05-04 17:17:47.538512 +0300    1493907467  leo_storage_read_repairer:compare/4 165 [{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907421704031},{cause,primary_inconsistency}]
    [W] storage_0@192.168.3.53  2017-05-04 17:17:47.542151 +0300    1493907467  leo_storage_read_repairer:compare/4 165 [{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907421704031},{cause,primary_inconsistency}]
    [W] storage_0@192.168.3.53  2017-05-04 17:17:47.549344 +0300    1493907467  leo_storage_read_repairer:compare/4 165 [{node,'storage_1@192.168.3.54'},{addr_id,192490066507992604465461441302734706270},{key,<<"body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz">>},{clock,1493907421704031},{cause,primary_inconsistency}]

What happened here - it's that "minor load" that I mentioned. Basically at 17:17:13 application tried to do PUT operation of object body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz. That's a very small object, 27 KB in size. Some moments after successful PUT few (5, I believe) other applications did GET for that object. However, they all were using the same gateway with caching enabled, so they should've gotten the object from memory cache (at worst gateway would've checked ETag against storage node). 17:17:13 was in the middle of "delete bucket" operation, so I suppose the fact that there was large "processing time" for PUT was expected. But why "read_repairer" errors and "primary_inconsistency"?? Storage_0 is "primary" node for this object:

[root@leo-m1 ~]# /usr/local/leofs/current/leofs-adm -p 10011 whereis body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | storage_0@192.168.3.53      | 90d03d01c8e65bffaba62e8fac56165e     |        25K |   dd009e23e7 | false          |              0 | 54eb36e9dcecf  | 2017-05-04 17:17:25 +0300
       | storage_1@192.168.3.54      | 90d03d01c8e65bffaba62e8fac56165e     |        25K |   dd009e23e7 | false          |              0 | 54eb36e9dcecf  | 2017-05-04 17:17:25 +0300
mocchira commented 7 years ago

WIP

Problems

Related Issues

mocchira commented 7 years ago

@vstax Thanks for reporting in detail.

Timeouts on gateway - not really a problem, as long as operation goes on asynchronously

As I commented on the above, there are some problems.

Typos "mehtod,delete", "mehtod,head", "mehtod,fetch" in info log. Note that it's correct in error log :)

This is not typos (method head/fetch are used during a delete bucket operation internally).

The fact that delete operation did not complete (I have picked a ~4300 random object names and executed "whereis" for them; around 1750 of them was marked as "deleted" on all nodes and around 2500 weren't deleted on any of them).

As I answered at the above question, the restart can cause delete operations to stop in the middle.

The fact that delete queues got stuck. How do I "unfreeze" them? Reboot storage nodes? (not a problem, I'm just keeping them like that for now in case there is something else to try). There no errors or anything right now (however, debug logs are no enabled); state of all nodes is "running", but delete queue is not being processed on storage_1 and storage_2.

It seems actually delete queues are freed up however the number mq-stats displays is non-zero. This kind of inconsistency between the actual items in a queue and the number mq-stats display can happen in case the restart happen on leo_storage. We will get over this inconsistency problem somehow.

EDIT: filed this inconsistency problem on https://github.com/leo-project/leofs/issues/731

These lines in log of storage_1

WIP.

yosukehara commented 7 years ago

I've made a deletion bucket processing's diagram to clarify how to fix this issue, whose diagram covers #150.

leofs-del-bucket-processing

vstax commented 7 years ago

@mocchira @yosukehara Thank you for analyzing.

This seems like a complicated issue; when looking at #150 and #701 I thought this is supposed to work as long as I don't create bucket with the same name again, but apparently I had too much hopes.

Too much retries going on in parallel behind the scene

I can't do anything about retries from leo_gateway to leo_storage, but I can try to do it with different S3 client which will only do "delete bucket" operation once without retries and share if it works any better. However, I've stumbled into something else regarding queues so I'll leave everything be for now..

This is not typos (method head/fetch are used during a delete bucket operation internally).

No, not that one. "mehtod" part is a typo. From here: https://github.com/leo-project/leo_object_storage/blob/develop/src/leo_object_storage_event.erl#L55

It seems actually delete queues are freed up however the number mq-stats displays is non-zero. This kind of inconsistency between the actual items in a queue and the number mq-stats display can happen in case the restart happen on leo_storage. We will get over this inconsistency problem somehow.

I've restarted managers, queue size didn't change. I've restarted storage_1 - and the queue started to reduce. I got ~100% CPU usage on that node.

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 53556          | 1440           | 550            | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 53556          | 1280           | 600            | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 53556          | 800            | 750            | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   running   | 52980          | 480            | 850            | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   running   | 52800          | 480            | 850            | async deletion of objs                      

After a minute or two I got two errors in error.log of storage_1:

[W] storage_1@192.168.3.54  2017-05-10 13:17:43.733429 +0300    1494411463  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/2d/2b/d8/2d2bd8b1f701626d70f4f253384d748463a48f2007e61da5c104630267c049908977838a1028e268132585cbd268f3ca03c4020100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-10 13:17:44.732449 +0300    1494411464  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/2d/2b/d8/2d2bd8b1f701626d70f4f253384d748463a48f2007e61da5c104630267c049908977838a1028e268132585cbd268f3ca03c4020100000000.xz">>},{cause,timeout}]

The CPU usage went to 0, the queue "froze" again. But half a minute error I see the same 100% CPU usage again. Then it goes to 0 again. Then high again. All this time, the "number of msgs" in queue doesn't change, but "interval" number changes:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 52440          | 0              | 1450           | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 52440          | 0              | 1700           | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 52440          | 0              | 2200           | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 52440          | 0              | 2900           | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   running   | 52440          | 0              | 2950           | async deletion of objs                      
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   idling    | 52440          | 0              | 3000           | async deletion of objs                      

After this, at some point the mq-stats command for this node started to work really slow, taking 7-10 seconds to respond, if executed during period of 100% CPU usage. Nothing else in error logs all this time. I see the same values (52400 / 0 / 3000 for async deletion queue, 0 / 0 / 0 for all others) but it takes 10 seconds to respond. It's still fast during 0% CPU usage period, but since node switches between these all the time now it's pretty random.

I had debug logs enabled, I saw lots of lines in storage_1 debug log during this time. At first it was like this:

[D] storage_1@192.168.3.54  2017-05-10 13:17:13.74131 +0300 1494411433  leo_storage_handler_object:delete/4 596 [{from,leo_mq},{method,del},{key,<<"bodytest/72/80/4f/72804f11dd276935ff759f28e4363761b6b2311ab33ffb969a41d33610c17a78e56971eeaa283bc5724ebff74c9797a27822010000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D] storage_1@192.168.3.54  2017-05-10 13:17:13.74432 +0300 1494411433  leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/5b/e0/39/5be039360a4f0050e39c44eafde1ba847bd54593885605f22e06f4ee351e081cf75e5820483bbb11e6350d7cd2853542c495000000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54  2017-05-10 13:17:13.74707 +0300 1494411433  leo_storage_handler_object:delete/4 596 [{from,leo_mq},{method,del},{key,<<"bodytest/5b/e0/39/5be039360a4f0050e39c44eafde1ba847bd54593885605f22e06f4ee351e081cf75e5820483bbb11e6350d7cd2853542c495000000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D] storage_1@192.168.3.54  2017-05-10 13:17:13.74915 +0300 1494411433  leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/39/58/a6/3958a6e0b7e1f33eaec7c5634498bb65579d13b8dff8983943b9144359b74206627e398b44a7ab0a29eb00169a9651230600100000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54  2017-05-10 13:17:13.75166 +0300 1494411433  leo_storage_handler_object:delete/4 596 [{from,leo_mq},{method,del},{key,<<"bodytest/39/58/a6/3958a6e0b7e1f33eaec7c5634498bb65579d13b8dff8983943b9144359b74206627e398b44a7ab0a29eb00169a9651230600100000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D] storage_1@192.168.3.54  2017-05-10 13:17:13.75400 +0300 1494411433  leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/2d/2b/d8/2d2bd8b1f701626d70f4f253384d748463a48f2007e61da5c104630267c049908977838a1028e268132585cbd268f3ca03c4020100000000.xz">>},{req_id,0}]

Then (note a gap in time! This - 13:25 - is few minutes after the queue got "stuck" at 52404 number. Could it be that something restarted and queue "unstuck" for a moment here?):

[D] storage_1@192.168.3.54  2017-05-10 13:18:02.921132 +0300    1494411482  leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/11/f3/aa/11f3aafb5d279afbcbb0ad9ff76a24f806c5fa1bd64eb54691629363dd0771394f81e4eb216e489d5169395736e80d992078020000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54  2017-05-10 13:18:02.922308 +0300    1494411482  leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/7a/e0/82/7ae0820cb42d3224fc9ac54b86e6f4c21ea567c81c91d65f524cd27e4777cb5fd3ff4d415ec8b2529c4da616f58b830ec844010000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54  2017-05-10 13:27:18.952873 +0300    1494412038  null:null   0   Supervisor inet_gethost_native_sup started undefined at pid <0.10159.0>
[D] storage_1@192.168.3.54  2017-05-10 13:27:18.953587 +0300    1494412038  null:null   0   Supervisor kernel_safe_sup started inet_gethost_native:start_link() at pid <0.10158.0>
[D] storage_1@192.168.3.54  2017-05-10 13:27:52.990768 +0300    1494412072  leo_storage_handler_object:put/4    404 [{from,storage},{method,delete},{key,<<"bodytest/b1/28/81/b12881f64bd8bb9e7382dc33bad442cdc91b0372bcdbbf1dcbd9bacda421e9a2ee24d479dba47d346c0b89bc06e74dc62540010000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54  2017-05-10 13:27:52.995161 +0300    1494412072  leo_storage_handler_object:put/4    404 [{from,storage},{method,delete},{key,<<"bodytest/8a/7a/71/8a7a715855dabae364d61c1c05a5872079a5ca82588e894fdc83c647530c50cb0c910981b2b4cf62ac9625983fee7661d840010000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54  2017-05-10 13:27:52.998699 +0300    1494412072  leo_storage_handler_object:put/4    404 [{from,storage},{method,delete},{key,<<"bodytest/96/35/56/963556c85b8a97d1d6d6b3a5f33f649dcdd6c9d89729c7c517d364f8c498eb5e214c1af2d694299d50f504f42f31fd60a816010000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54  2017-05-10 13:27:53.294 +0300   1494412073  leo_storage_handler_object:put/4    404 [{from,storage},{method,delete},{key,<<"bodytest/5a/3a/e0/5a3ae0c07352fdf97d3720e4afdec76ba4c3e2f60ede654f675ce68e9b5f749fd40e6bc1b3f5855c1c085402c0b3ece9a0ef000000000000.xz">>},{req_id,0}]

At some point (13:28:40 to be precise), messages have stopped appearing.

I've repeated experiment with storage_2 and the situation at first was exactly the same, just with different numbers. However, unlike storage_1 there are other messages in error log:

[E] storage_2@192.168.3.55  2017-05-10 13:30:04.679350 +0300    1494412204  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_2@192.168.3.55  2017-05-10 13:30:06.182672 +0300    1494412206  null:null   0   Error in process <0.23852.0> on node 'storage_2@192.168.3.55' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_2@192.168.3.55  2017-05-10 13:30:06.232671 +0300    1494412206  null:null   0   Error in process <0.23853.0> on node 'storage_2@192.168.3.55' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_2@192.168.3.55  2017-05-10 13:30:09.680281 +0300    1494412209  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_2@192.168.3.55  2017-05-10 13:30:14.681474 +0300    1494412214  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}

The last line repeats lots of times, endlessly. I can't execute "mq-stats" for this node anymore: it returns instantly without any results (like it happens when a node isn't running). However, its status is indeed "running":

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55
[root@leo-m0 ~]# /usr/local/bin/leofs-adm status |grep storage_2
  S    | storage_2@192.168.3.55      | running      | c1d863d0       | c1d863d0       | 2017-05-10 13:27:51 +0300
[root@leo-m0 ~]# /usr/local/bin/leofs-adm status storage_2@192.168.3.55
--------------------------------------+--------------------------------------
                Item                  |                 Value                
--------------------------------------+--------------------------------------
 Config-1: basic
--------------------------------------+--------------------------------------
                              version | 1.3.4
                     number of vnodes | 168
                    object containers | - path:[/mnt/avs], # of containers:8
                        log directory | /var/log/leofs/leo_storage/erlang
                            log level | debug
--------------------------------------+--------------------------------------
 Config-2: watchdog
--------------------------------------+--------------------------------------
 [rex(rpc-proc)]                      |
                    check interval(s) | 10
               threshold mem capacity | 33554432
--------------------------------------+--------------------------------------
 [cpu]                                |
                     enabled/disabled | disabled
                    check interval(s) | 10
               threshold cpu load avg | 5.0
                threshold cpu util(%) | 90
--------------------------------------+--------------------------------------
 [disk]                               |
                     enabled/disalbed | enabled
                    check interval(s) | 10
                threshold disk use(%) | 85
               threshold disk util(%) | 90
                    threshold rkb(kb) | 98304
                    threshold wkb(kb) | 98304
--------------------------------------+--------------------------------------
 Config-3: message-queue
--------------------------------------+--------------------------------------
                   number of procs/mq | 8
        number of batch-procs of msgs | max:3000, regular:1600
   interval between batch-procs (ms)  | max:3000, regular:500
--------------------------------------+--------------------------------------
 Config-4: autonomic operation
--------------------------------------+--------------------------------------
 [auto-compaction]                    |
                     enabled/disabled | disabled
        warning active size ratio (%) | 70
      threshold active size ratio (%) | 60
             number of parallel procs | 1
                        exec interval | 3600
--------------------------------------+--------------------------------------
 Config-5: data-compaction
--------------------------------------+--------------------------------------
  limit of number of compaction procs | 4
        number of batch-procs of objs | max:1500, regular:1000
   interval between batch-procs (ms)  | max:3000, regular:500
--------------------------------------+--------------------------------------
 Status-1: RING hash
--------------------------------------+--------------------------------------
                    current ring hash | c1d863d0
                   previous ring hash | c1d863d0
--------------------------------------+--------------------------------------
 Status-2: Erlang VM
--------------------------------------+--------------------------------------
                           vm version | 7.3
                      total mem usage | 158420648
                     system mem usage | 107431240
                      procs mem usage | 50978800
                        ets mem usage | 5926016
                                procs | 428/1048576
                          kernel_poll | true
                     thread_pool_size | 32
--------------------------------------+--------------------------------------
 Status-3: Number of messages in MQ
--------------------------------------+--------------------------------------
                 replication messages | 0
                  vnode-sync messages | 0
                   rebalance messages | 0
--------------------------------------+--------------------------------------

To conclude: it seems to me that it's not the queue numbers that are fake and queues are free - there is stuff in queues. Restarting makes them processing again for a while, but pretty soon they get stuck once again. Plus, the situation seems different for storage_1 and storage_2.

mocchira commented 7 years ago

@vstax thanks for the detailed info.

No, not that one. "mehtod" part is a typo. From here: https://github.com/leo-project/leo_object_storage/blob/develop/src/leo_object_storage_event.erl#L55

Oops. Got it :)

I've restarted managers, queue size didn't change. I've restarted storage_1 - and the queue started to reduce. I got ~100% CPU usage on that node.

queue size didn't change and displays invalid number that is different from the actual one caused by https://github.com/leo-project/leofs/issues/731.

The CPU usage went to 0, the queue "froze" again. But half a minute error I see the same 100% CPU usage again. Then it goes to 0 again. Then high again. All this time, the "number of msgs" in queue doesn't change, but "interval" number changes:

It seems half a minute error caused by https://github.com/leo-project/leofs/issues/728. Fluctuating the CPU usage from 100 to 0 back and forth repeatedly might imply there are some items that can't be consumed and keep existing for some reason. I will vet in detail.

EDIT: found the fault here: https://github.com/leo-project/leofs/blob/1.3.4/apps/leo_storage/src/leo_storage_mq.erl#L342-L363 After all, there are some items that can't be consumed in QUEUE_ID_ASYNC_DELETION. They keep existing if the target object was already deleted.

After this, at some point the mq-stats command for this node started to work really slow, taking 7-10 seconds to respond, if executed during period of 100% CPU usage.

Getting the response from any command through leofs-adm slow is one of the symptom the Erlang runtime overloaded. If you can reproduce, would you like to execute https://github.com/leo-project/leofs_doctor against the overloaded node? The output would make it easy for us to debug in detail.

To conclude: it seems to me that it's not the queue numbers that are fake and queues are free - there is stuff in queues. Restarting makes them processing again for a while, but pretty soon they get stuck once again. Plus, the situation seems different for storage_1 and storage_2.

It seems the number is fake and there is stuff in queues.

mocchira commented 7 years ago

TODO

@vstax we will ask you to do the same test after we finish all TODOs described above.

yosukehara commented 7 years ago

@mocchira

Fix QUEUE_ID_ASYNC_DELETION to consume items properly even if the item was already deleted.

I've recognized leo_storage_mq has a bug about it when receiving {error, not_found}.

I'll send a PR and its fix will be included in v1.3.5.

vstax commented 7 years ago

@mocchira

It seems half a minute error caused by #728.

Well, it started to happen before I stopped storage_2 so all the nodes were running; also 30 seconds was just a very rough estimate when looking at top, it could be something in 10-20 second range as well as I wasn't paying strict attention (it wasn't anything less than 10 seconds for sure). I might be misunderstanding #728, though.

After all, there are some items that can't be consumed in QUEUE_ID_ASYNC_DELETION. They keep existing if the target object was already deleted.

Interesting find! I've did tests with double-deletes and deletes of non-existent object before but that was on 1.3.2.1 before changes to queue mechanism.

If you can reproduce, would you like to execute https://github.com/leo-project/leofs_doctor against the overloaded node?

I will. That load (100-0-100-..) on storage_1 has ended around 13:28, when the errors and messages in debug log have stopped appearing. The queue isn't consumed but the node itself is fine.

The storage_2 is still in bad shape, doesn't respond to mq-stats command and spits out

[E] storage_2@192.168.3.55  2017-05-11 10:43:35.310528 +0300    1494488615  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}

every 5 seconds. Also the errors that I've seen on storage_1 never appeared in storage_2 log.

However, when I restart nodes I'll probably see something else.

A question: you've found a case why messages in queue can stop from being processed. But after I restarted node, I clearly had > 1000 messages disappear from queue on each node. Besides lots of cases of double-messages like

info.20170510.13.1:[D]  storage_2@192.168.3.55  2017-05-10 13:27:50.400569 +0300    1494412070  leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/6e/15/f6/6e15f6d4febdf823f6f8af7e1f0947ee05a5a905875c3748d12f472831421ce00eefc659d884cc998dadd2bc3d4fc1fd30cc000000000000.xz">>},{req_id,0}]
info.20170510.13.1:[D]  storage_2@192.168.3.55  2017-05-10 13:27:50.400833 +0300    1494412070  leo_storage_handler_object:delete/4 596 [{from,leo_mq},{method,del},{key,<<"bodytest/6e/15/f6/6e15f6d4febdf823f6f8af7e1f0947ee05a5a905875c3748d12f472831421ce00eefc659d884cc998dadd2bc3d4fc1fd30cc000000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]

there were quite successful deletes like

info.20170510.13.1:[D]  storage_2@192.168.3.55  2017-05-10 13:28:40.717438 +0300    1494412120  leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/fc/e3/a3/fce3a3f19655893ef1113627be71afe416987e6770337940e7d533662d7821fa8e74463d4c41ca1fdcd526c6ffb3a14e00ea090000000000.xz">>},{req_id,0}]
info.20170510.13.1:[D]  storage_2@192.168.3.55  2017-05-10 13:28:40.719168 +0300    1494412120  leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/f5/6b/01/f56b019f9b473ccb07efbf5091d3ce257b1dcfce862669b2684be231c4f028ce92e8b4fc2dd1ac58248210ac99744ea60018000000000000.xz">>},{req_id,0}]
info.20170510.13.1:[D]  storage_2@192.168.3.55  2017-05-10 13:28:40.723881 +0300    1494412120  leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/c4/3c/46/c43c46dd688723e79858c0af76107cc370ad7aebbac60c604de7a8bee450b9b78f3c8222272aefd3bc66579cf3fb12ca10c4000000000000.xz">>},{req_id,0}]

on both storage_1 and storage_2. So somehow node restart makes part of queue to process, even if it wasn't processing when node was running.

I could upload work/queue/4 (around 60 MB for both nodes) somewhere, then restart nodes and see if this (successful processing of part of queue) happens again. Would queue contents help you in debugging?

mocchira commented 7 years ago

@yosukehara https://github.com/leo-project/leofs/issues/732

mocchira commented 7 years ago

@vstax

Well, it started to happen before I stopped storage_2 so all the nodes were running; also 30 seconds was just a very rough estimate when looking at top, it could be something in 10-20 second range as well as I wasn't paying strict attention (it wasn't anything less than 10 seconds for sure). I might be misunderstanding #728, though.

Since there are multiple consumer processes/files (IIRC, default 4 or 8) per the one queue (in this case ASYNC_DELETION), the period can vary under 30 seconds.

Interesting find! I've did tests with double-deletes and deletes of non-existent object before but that was on 1.3.2.1 before changes to queue mechanism.

Make sense!

I will. That load (100-0-100-..) on storage_1 has ended around 13:28, when the errors and messages in debug log have stopped appearing. The queue isn't consumed but the node itself is fine.

Thanks.

A question: you've found a case why messages in queue can stop from being processed. But after I restarted node, I clearly had > 1000 messages disappear from queue on each node. Besides lots of cases of double-messages like ... on both storage_1 and storage_2. So somehow node restart makes part of queue to process, even if it wasn't processing when node was running.

Seems something I haven't noticed still there. Your queue file might help me debug further.

I could upload work/queue/4 (around 60 MB for both nodes) somewhere, then restart nodes and see if this (successful processing of part of queue) happens again. Would queue contents help you in debugging?

Yes! please share via anything you like. (off topic: previously you shared some stuff via https://cloud.mail.ru/ that was amazingly the fast one I've ever used :)

vstax commented 7 years ago

@mocchira I've packed the queues (the nodes were running but there was no processing) and uploaded them to https://www.dropbox.com/s/78uitcmohhuq3mq/storage-queues.tar.gz?dl=0 (it's not such a big file so dropbox should work, I think?).

Now, after I restarted storage_1.. a miracle! The queue had fully consumed, without any errors in logs or anything. Debug log was as usual:

[D] storage_1@192.168.3.54  2017-05-11 22:44:09.681558 +0300    1494531849  leo_storage_handler_object:delete/4 596 [{from,leo_mq},{method,del},{key,<<"bodytest/76/74/02/767402b5880aa54206793cb197e3fccf4bacf4e516444cd6c88eeea8c9d25af461bb30bcb513041ac033c8db12e7e67e4c09010000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D] storage_1@192.168.3.54  2017-05-11 22:44:09.681905 +0300    1494531849  leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/2a/2e/88/2a2e88feb2ed55c266961a2fcfd80b9f5f02d48fd757e79e3ac9268d1c45139334492579bc98db2e8d53338097239f4e28fe010000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54  2017-05-11 22:44:09.682166 +0300    1494531849  leo_storage_handler_object:delete/4 596 [{from,leo_mq},{method,del},{key,<<"bodytest/2a/2e/88/2a2e88feb2ed55c266961a2fcfd80b9f5f02d48fd757e79e3ac9268d1c45139334492579bc98db2e8d53338097239f4e28fe010000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D] storage_1@192.168.3.54  2017-05-11 22:44:09.682426 +0300    1494531849  leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/3d/e8/88/3de888009faa04a6860550b94d4bb2f19fe01958ad28229a38bf4eeafd399d5a569d4130b008b48ab6d51889add0aa2e2570010000000000.xz">>},{req_id,0}]
[..skipped..]
[D] storage_1@192.168.3.54  2017-05-11 22:48:41.454128 +0300    1494532121  leo_storage_handler_object:put/4    404 [{from,storage},{method,delete},{key,<<"bodytest/58/15/b6/5815b6600a1d5aa3c46b00dffa3e0a9da7c50f7c75dc4058bbc503f6aca8c74396ce93889a7864ad14207c98445b914da443000000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54  2017-05-11 22:48:41.455928 +0300    1494532121  leo_storage_handler_object:put/4    404 [{from,storage},{method,delete},{key,<<"bodytest/f0/d8/4d/f0d84d4f4b6cb071fb88f3107a00d87be6a849dc304ec7a738c9d7ac4f7e97f7e5ff30a6beff3536fe6267f8af26e57b3ce9000000000000.xz">>},{req_id,0}]

Error log - nothing to show. Queue state during these 4 minutes (removed some extra lines):

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
 leo_async_deletion_queue       |   running   | 51355          | 1600           | 500            | async deletion of objs                      
 leo_async_deletion_queue       |   running   | 46380          | 320            | 900            | async deletion of objs                      
 leo_async_deletion_queue       |   running   | 46200          | 0              | 1000           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 37377          | 0              | 1400           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 23740          | 0              | 1550           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 13480          | 0              | 1750           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 1814           | 0              | 2000           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 0              | 0              | 2050           | async deletion of objs                      

I've restarted storage_2 as well. I got no '-decrease/3-lc$^0/1-0-' errors this time! Like, at all. At first queue was processing, then eventually it stuck:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_
 leo_async_deletion_queue       |   running   | 135142         | 1440           | 550            | async deletion of objs                      
 leo_async_deletion_queue       |   running   | 131602         | 800            | 750            | async deletion of objs                      
 leo_async_deletion_queue       |   running   | 129353         | 0              | 1000           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 129353         | 0              | 1700           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 129353         | 0              | 1700           | async deletion of objs                      
 leo_async_deletion_queue       |   running   | 129353         | 0              | 2200           | async deletion of objs                      
 leo_async_deletion_queue       |   idling    | 129353         | 0              | 3000           | async deletion of objs                      

I got just this in error log around the time it froze:

[W] storage_2@192.168.3.55  2017-05-11 22:48:21.736404 +0300    1494532101  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/b3/64/07/b36407b89a66f64c10e6298af4ba894cd6c2dc501dfd1b65f4567b182777f58c6485e8dc435e19af5b08960aceb946ed289e7d0000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-05-11 22:48:22.733389 +0300    1494532102  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/b3/64/07/b36407b89a66f64c10e6298af4ba894cd6c2dc501dfd1b65f4567b182777f58c6485e8dc435e19af5b08960aceb946ed289e7d0000000000.xz">>},{cause,timeout}]

Now it spends 10-20 seconds in 100% CPU state, then switches back to 0, then 100% again and so on. Just like storage_1 did during the last experiment. And like last time, "mq-stats" makes me wait if executed during 100% CPU usage period. The whole situation seems quite random...

EDIT: 1 hour after experiment, everything still the same; 100% CPU usage alternating with 0% CPU usage. Nothing in error logs (besides some disk watchdog messages as I'm over 80% disk usage on volume with AVS files). This is unlike last experiment (https://github.com/leo-project/leofs/issues/725#issuecomment-300446740) when storage_1 wrote something about "Supervisor started" in logs at some point and stopped consuming CPU soon after.

Leo_doctor logs: https://pastebin.com/y9RgXtEK https://pastebin.com/rsxLCwDN and https://pastebin.com/PMFeRxFH First one, I think it was executed completely during 100% CPU usage period. Second one started during 100% CPU usage and last 3 seconds or so were during near-0% CPU usage period. The third one was without that "expected_svt" option (I don't know the difference so not sure which one you need). The third one started during 100% CPU usage and last 4-5 seconds were during near-0% usage period.

EDIT: 21 hours after experiment, 100% CPU usage alternating with 0% CPU on storage_2 has stopped. Nothing related in logs, really; not in error.log nor in erlang.log. No mention of restarts or anything - just according to sar at 19:40 on May 12 the load was there, on 19:50 and from that point on it wasn't. The leo_async_deletion_queue queue is unchanged, 129353 / 0 / 3000 messages just like it was at the moment it stopped processing. Just in case, leo_doctor logs from current moment (note that there might be very light load on this node, plus disk space watchdog triggered): https://pastebin.com/iUMn6uLX

mocchira commented 7 years ago

@vstax Still WIP although I'd like to share what I got at the moment.

Since the second one can be mitigated by reducing the number of consumers of leo_mq, I will add this workaround to https://github.com/leo-project/leofs/issues/725#issuecomment-300706776.

mocchira commented 7 years ago

Design considerations for https://github.com/leo-project/leofs/issues/725#issuecomment-300412054.

vstax commented 7 years ago

I've repeated - or, rather, tried to complete this experiment by re-adding "bodytest" bucket and removing it again on latest dev version. I don't expect it to work perfectly, but wanted to check how issues that were already fixed helped. Debug logs are disabled to make sure leo_logger problems won't affect anything, mq.num_of_mq_procs = 4 is set.

This time, I made sure to abort s3cmd rb s3://bodytest command after it sent the "remove bucket" request once so that it didn't try to repegat the request or anything. It's exactly the same system, but I estimate that over 60% of original amount of object (1M) are still present in "bodytest" bucket so there was a lot of stuff to remove.

gateway logs:

[W] gateway_0@192.168.3.52  2017-05-26 22:00:26.733769 +0300    1495825226  leo_gateway_s3_api:delete_bucket_2/31798    [{cause,timeout}]
[W] gateway_0@192.168.3.52  2017-05-26 22:00:31.734812 +0300    1495825231  leo_gateway_s3_api:delete_bucket_2/31798    [{cause,timeout}]

storage_0 info log:

[I] storage_0@192.168.3.53  2017-05-26 22:00:27.162670 +0300    1495825227  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,5354}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:34.240077 +0300    1495825234  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7504}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:34.324375 +0300    1495825234  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7162}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:42.957679 +0300    1495825242  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8633}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:43.469667 +0300    1495825243  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,9229}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:50.241744 +0300    1495825250  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7284}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:51.136573 +0300    1495825251  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7667}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:59.20997 +0300 1495825259  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7884}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:59.21352 +0300 1495825259  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz\n1">>},{processing_time,5700}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:20.242104 +0300    1495825280  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:21.264304 +0300    1495825281  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{processing_time,26450}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:38.156285 +0300    1495825298  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,39136}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:38.156745 +0300    1495825298  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/00/e4/ce/00e4ce45fd1c5de4122221d44289f4ace93dc0d046fead4f4d3549b7b756af04621a4ab684a1c7db7b8d5f017555484d90d7000000000000.xz">>},{processing_time,12339}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:38.157114 +0300    1495825298  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/59/49/0f/59490fdb17b17ce75b31909675e7262db9b01a84f04792cbe2f7858d114c48efc5d2f1cf98190dcf9af96a12679cbdccf8e89a0000000000.xz\n2">>},{processing_time,10976}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:38.157429 +0300    1495825298  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/63/29/b9/6329b983cb8e8ea323181e34d2d3b64403ff79671f6850a268406daab8fcf772009d994e804b5fc9611f0773a96d6cde94e3020100000000.xz\n3">>},{processing_time,10018}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:38.158711 +0300    1495825298  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,delete},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{processing_time,16894}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:27.162670 +0300    1495825227  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,5354}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:34.240077 +0300    1495825234  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7504}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:34.324375 +0300    1495825234  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7162}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:42.957679 +0300    1495825242  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8633}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:43.469667 +0300    1495825243  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,9229}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:50.241744 +0300    1495825250  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7284}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:51.136573 +0300    1495825251  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7667}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:59.20997 +0300 1495825259  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7884}]
[I] storage_0@192.168.3.53  2017-05-26 22:00:59.21352 +0300 1495825259  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz\n1">>},{processing_time,5700}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:20.242104 +0300    1495825280  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:21.264304 +0300    1495825281  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{processing_time,26450}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:38.156285 +0300    1495825298  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,39136}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:38.156745 +0300    1495825298  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/00/e4/ce/00e4ce45fd1c5de4122221d44289f4ace93dc0d046fead4f4d3549b7b756af04621a4ab684a1c7db7b8d5f017555484d90d7000000000000.xz">>},{processing_time,12339}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:38.157114 +0300    1495825298  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/59/49/0f/59490fdb17b17ce75b31909675e7262db9b01a84f04792cbe2f7858d114c48efc5d2f1cf98190dcf9af96a12679cbdccf8e89a0000000000.xz\n2">>},{processing_time,10976}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:38.157429 +0300    1495825298  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/63/29/b9/6329b983cb8e8ea323181e34d2d3b64403ff79671f6850a268406daab8fcf772009d994e804b5fc9611f0773a96d6cde94e3020100000000.xz\n3">>},{processing_time,10018}]
[I] storage_0@192.168.3.53  2017-05-26 22:01:38.158711 +0300    1495825298  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,delete},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{processing_time,16894}]

Error log:

[E] storage_0@192.168.3.53  2017-05-26 21:58:54.581809 +0300    1495825134  leo_backend_db_eleveldb:first_n/2   282 {badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.23172>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E] storage_0@192.168.3.53  2017-05-26 21:58:54.582525 +0300    1495825134  leo_mq_server:handle_call/3 287 {badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.23172>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E] storage_0@192.168.3.53  2017-05-26 21:58:54.924670 +0300    1495825134  leo_backend_db_eleveldb:first_n/2   282 {badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.23850>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E] storage_0@192.168.3.53  2017-05-26 21:58:54.927313 +0300    1495825134  leo_mq_server:handle_call/3 287 {badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.23850>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E] storage_0@192.168.3.53  2017-05-26 21:58:55.42756 +0300 1495825135  leo_backend_db_eleveldb:first_n/2   282 {badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.24459>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E] storage_0@192.168.3.53  2017-05-26 21:58:55.43297 +0300 1495825135  leo_mq_server:handle_call/3 287 {badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.24459>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[W] storage_0@192.168.3.53  2017-05-26 22:01:24.864263 +0300    1495825284  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-05-26 22:01:25.816594 +0300    1495825285  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-05-26 22:02:08.375387 +0300    1495825328  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-05-26 22:02:09.382327 +0300    1495825329  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{cause,timeout}]

storage_1 info log:

[I] storage_1@192.168.3.54  2017-05-26 22:00:26.993464 +0300    1495825226  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,5221}]
[I] storage_1@192.168.3.54  2017-05-26 22:00:34.450344 +0300    1495825234  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7456}]
[I] storage_1@192.168.3.54  2017-05-26 22:00:34.899198 +0300    1495825234  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8167}]
[I] storage_1@192.168.3.54  2017-05-26 22:00:34.900451 +0300    1495825234  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/9f/c1/62/9fc1621aacf06357ccd85ce7e43f4dc17eafe60bce3cb8cf864487f61ea4667ac7eded91411ee9e1fc0b7180119f29670400000000000000.xz">>},{processing_time,5424}]
[I] storage_1@192.168.3.54  2017-05-26 22:00:46.351992 +0300    1495825246  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,11453}]
[I] storage_1@192.168.3.54  2017-05-26 22:00:46.352702 +0300    1495825246  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/92/e4/fc/92e4fcc551dc03361f41f59e37eca7161d4dfb23fff803200bf1b990a2e81e0ed909d74c2613e90259c48c4a385702b1dc51010000000000.xz">>},{processing_time,8778}]
[I] storage_1@192.168.3.54  2017-05-26 22:01:00.258646 +0300    1495825260  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,25808}]
[I] storage_1@192.168.3.54  2017-05-26 22:01:00.259186 +0300    1495825260  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,15039}]
[I] storage_1@192.168.3.54  2017-05-26 22:01:21.291575 +0300    1495825281  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,34940}]
[I] storage_1@192.168.3.54  2017-05-26 22:01:21.292084 +0300    1495825281  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,29112}]
[I] storage_1@192.168.3.54  2017-05-26 22:01:21.292789 +0300    1495825281  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/2b/5c/f3/2b5cf31eaffd8e884937240a026abec1c6a48f66b042c08cca9b80250e9a58dd2216871bdb0dddbbaae4d6e7eb0896538498000000000000.xz">>},{processing_time,5069}]
[I] storage_1@192.168.3.54  2017-05-26 22:01:21.294835 +0300    1495825281  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,delete},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,21036}]
[I] storage_1@192.168.3.54  2017-05-26 22:01:30.189080 +0300    1495825290  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,29930}]
[I] storage_1@192.168.3.54  2017-05-26 22:01:30.189895 +0300    1495825290  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/01/9c/aa/019caaf22c84f6e77c5f5597810faa55ef57c71a38a133cbe9d38c631e40d11434ff449f989d77d408571af4c06e11aeb475000000000000.xz">>},{processing_time,28628}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:00.189674 +0300    1495825320  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:08.370447 +0300    1495825328  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{processing_time,30001}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.574818 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/0f/df/87/0fdf870a9a237805c0282ba71c737966f2630124921b5c8709b6f470754b3e187eebdd30e80d404ccb700be646bc3c03bfa6020100000000.xz\n1">>},{processing_time,29259}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.575370 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/12/ed/21/12ed21380e8cb085deb10aa161feb131b581553ab1ead52e24ed88619b2ec7709d59b9e69b3d7bb0febc5930048bb1a0d8a2020100000000.xz\n2">>},{processing_time,17599}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.575744 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/29/95/03/2995035be6f7fbe86d6f4f76eba845bfc50338bd40535d9947e473779538a5ba6de5534672c3b5146fb5768b9e905a4318fa7b0000000000.xz\n1">>},{processing_time,40674}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.576122 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/4d/93/47/4d934795bb3006b7e35d99ca7485bfaa1b9cc1b8878fe11f260e0ffedb8e1d97f66221bfbb048ac5ce8298ae93e922be46e8020100000000.xz\n1">>},{processing_time,24915}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.576518 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/59/89/e7/5989e7825beeb82933706f559ab737cfe0eb88156471a29e0c6f6ae04c00576f0b0c5462f6714d2387a1856f99cdf3fc89ab040100000000.xz\n1">>},{processing_time,9456}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.576883 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/2f/99/d1/2f99d1ffa377ceda4341d1c0a85647f17fade7e8e375eafb1b8e1a17bd794fa9683a0546ed594ce2a18944c3e817498f00821a0100000000.xz\n1">>},{processing_time,38070}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.578804 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/63/7f/75/637f7568ee27aa13f0ccabc34d68faac4500535cb4c3f34b4b5d4349d80a6a96de46bcc04522f76debd1060647083a4850955c0000000000.xz\n1">>},{processing_time,8954}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.579637 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/07/7c/11/077c11796ee67c7a15027cf21b749ffbfd244c06980bf98a945acdd92b3404feb56609b8a0b177cd205d309e0d8310a6b0df5b0000000000.xz\n1">>},{processing_time,37267}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.580231 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/49/c8/3f/49c83ff8341d50259f4138707688613860802327ebb2e75d9019bda193c8ab82a3b66b4f7e92d4d9dc0f3d39c082010e5694370100000000.xz\n2">>},{processing_time,35610}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.581187 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/1a/9b/07/1a9b073aafa182620e4bb145507a097320bb4097ebec0dfddee3936a96e0cb83fc10ed7a7bfcd3f20456a3cdf0a373be3026700000000000.xz\n1">>},{processing_time,35500}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.581593 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/22/8c/64/228c648d769e51472f79670cbb804f9bee23d8d9ea6612ee4a21ea11b901ef60732e3657a2e4fb68ce26b745525ada7ab0b5790000000000.xz\n1">>},{processing_time,34513}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.582178 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/42/29/fb/4229fb92dac335eb214e0eef2d2bd59d25685ae9ace816f44eb4d37147921ad66b5be7ccc97938aacfdfc64c1e721f1ed2a1020100000000.xz\n2">>},{processing_time,20930}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.582963 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/14/d7/9b/14d79b2fd3b666cf511d1c4e55dddf2b44f998312dc0103cd26dd7227dba14ce0ddfe0e8e87a64d30e49f788081cd75a39bc000000000000.xz">>},{processing_time,14189}]
[I] storage_1@192.168.3.54  2017-05-26 22:02:23.583762 +0300    1495825343  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/39/b5/e3/39b5e371ed1f857e881725e5b491810862a291268efb395e948da6d83934bc19d3ef8fc7c5a9584bcd18bd174c3e080dfba2020100000000.xz\n1">>},{processing_time,13991}]

Error log:

[W] storage_1@192.168.3.54  2017-05-26 22:01:15.220528 +0300    1495825275  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-26 22:01:16.221833 +0300    1495825276  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{cause,timeout}]

storage_2 info log:

[I] storage_2@192.168.3.55  2017-05-26 22:00:34.873903 +0300    1495825234  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8110}]
[I] storage_2@192.168.3.55  2017-05-26 22:00:35.352063 +0300    1495825235  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8615}]
[I] storage_2@192.168.3.55  2017-05-26 22:00:35.359634 +0300    1495825235  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/9f/c1/62/9fc1621aacf06357ccd85ce7e43f4dc17eafe60bce3cb8cf864487f61ea4667ac7eded91411ee9e1fc0b7180119f29670400000000000000.xz">>},{processing_time,5868}]
[I] storage_2@192.168.3.55  2017-05-26 22:00:46.957075 +0300    1495825246  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,11605}]
[I] storage_2@192.168.3.55  2017-05-26 22:00:46.958526 +0300    1495825246  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/92/e4/fc/92e4fcc551dc03361f41f59e37eca7161d4dfb23fff803200bf1b990a2e81e0ed909d74c2613e90259c48c4a385702b1dc51010000000000.xz">>},{processing_time,9393}]
[I] storage_2@192.168.3.55  2017-05-26 22:00:46.958917 +0300    1495825246  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz\n2">>},{processing_time,7222}]
[I] storage_2@192.168.3.55  2017-05-26 22:01:04.874732 +0300    1495825264  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I] storage_2@192.168.3.55  2017-05-26 22:01:05.757004 +0300    1495825265  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/15/2f/82disk_usage/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,20530}]
[I] storage_2@192.168.3.55  2017-05-26 22:01:18.153498 +0300    1495825278  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,31196}]
[I] storage_2@192.168.3.55  2017-05-26 22:01:18.154052 +0300    1495825278  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,head},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,25974}]
[I] storage_2@192.168.3.55  2017-05-26 22:01:18.159729 +0300    1495825278  leo_object_storage_event:handle_event/2 54  [{cause,"slow operation"},{method,delete},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,12403}]

Error log - empty.

Queue states: For storage_0, within 30 seconds after delete bucket operation, the queue has reached this number:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_0@192.168.3.53|grep leo_async_deletion
 leo_async_deletion_queue       |   idling    | 80439          | 1600           | 500            | async deletion of objs                      

which was dropping pretty fast

 leo_async_deletion_queue       |   running   | 25950          | 3000           | 0              | async deletion of objs                      

and reached 0 like 2-3 minutes after start of operation:

 leo_async_deletion_queue       |   idling    | 0              | 1600           | 500            | async deletion of objs                      

For storage_1, likewise the queue got to this number within 30 seconds, but its status was "suspending"

 leo_async_deletion_queue       | suspending  | 171957         | 0              | 1700           | async deletion of objs                      

it was "suspending" all the time during experiment. It barely dropped and stays at this number even now:

 leo_async_deletion_queue       | suspending  | 170963         | 0              | 1500           | async deletion of objs                      

For storage_2, the number was this within 30 seconds after start

 leo_async_deletion_queue       |   idling    | 34734          | 0              | 2400           | async deletion of objs                      

it was dropping slowly (quite unlike storage_0) and has reached this number when it stopped reducing:

 leo_async_deletion_queue       |   idling    | 29448          | 0              | 3000           | async deletion of objs                      

At this point the system is stable; nothing going on, there is no load, most of objects from "bodytest" still aren't removed, the queues for storage_1 and storage_2 are stalled with the numbers like above. There is nothing else in log files.

I stop storage_2, make backup of its queues (just in case). I start it, the number in queue is the same at first. 20-30 seconds after node is started, it starts to reduce. There are new messages in error logs:

[W] storage_2@192.168.3.55  2017-05-26 22:36:53.397898 +0300    1495827413  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-05-26 22:36:54.377776 +0300    1495827414  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{cause,timeout}]

The number in queue eventually reduces to 0.

I stop storage_1, make backup of its queues, start it again. Right after start the queue starts processing:

 leo_async_deletion_queue       |   running   | 168948         | 1600           | 500            | async deletion of objs                      

Then CPU load on node goes very high, "mq-stats" command starts to hang, I see this in error logs:

[W] storage_1@192.168.3.54  2017-05-26 22:49:59.603367 +0300    1495828199  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/3d/8c/78/3d8c78839ebc79cba43a1b57f138e1e3d4c422269f8aa522a55242b49cdc2ffca756d4d799f58dc0b6009f0f2e7a4638482a680000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-26 22:50:00.600085 +0300    1495828200  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/3d/8c/78/3d8c78839ebc79cba43a1b57f138e1e3d4c422269f8aa522a55242b49cdc2ffca756d4d799f58dc0b6009f0f2e7a4638482a680000000000.xz">>},{cause,timeout}]
[E] storage_1@192.168.3.54  2017-05-26 22:50:52.705543 +0300    1495828252  leo_mq_server:handle_call/3 287 {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,{first_n,0},30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:50:53.757233 +0300    1495828253  null:null   0   gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:50:54.262824 +0300    1495828254  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-26 22:50:54.264280 +0300    1495828254  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.316.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:51:03.461439 +0300    1495828263  null:null   0   gen_fsm leo_async_deletion_queue_consumer_1_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:51:03.461919 +0300    1495828263  null:null   0   gen_fsm leo_async_deletion_queue_consumer_4_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:51:03.462275 +0300    1495828263  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-26 22:51:03.462926 +0300    1495828263  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.318.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:51:03.481700 +0300    1495828263  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_4_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-26 22:51:03.482332 +0300    1495828263  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.312.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:51:24.823088 +0300    1495828284  null:null   0   gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:51:24.823534 +0300    1495828284  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-26 22:51:24.825905 +0300    1495828284  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.20880.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:51:34.85988 +0300 1495828294  null:null   0   gen_fsm leo_async_deletion_queue_consumer_4_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:51:34.87305 +0300 1495828294  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_4_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-26 22:51:34.95578 +0300 1495828294  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.27909.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:51:34.522235 +0300    1495828294  null:null   0   gen_fsm leo_async_deletion_queue_consumer_1_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:51:34.525223 +0300    1495828294  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-26 22:51:34.539198 +0300    1495828294  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.27892.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:51:34.539694 +0300    1495828294  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.27892.1> exit with reason reached_max_restart_intensity in context shutdown
[E] storage_1@192.168.3.54  2017-05-26 22:51:34.541076 +0300    1495828294  null:null   0   Supervisor leo_redundant_manager_sup had child undefined started with leo_mq_sup:start_link() at <0.220.0> exit with reason shutdown in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:51:34.976730 +0300    1495828294  null:null   0   Error in process <0.20995.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:35.122748 +0300    1495828295  null:null   0   Error in process <0.20996.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:35.140676 +0300    1495828295  null:null   0   Error in process <0.20997.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:35.211716 +0300    1495828295  null:null   0   Error in process <0.20998.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:35.367975 +0300    1495828295  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:51:36.17706 +0300 1495828296  null:null   0   Error in process <0.21002.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:36.68751 +0300 1495828296  null:null   0   Error in process <0.21005.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:36.273259 +0300    1495828296  null:null   0   Error in process <0.21011.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:37.246142 +0300    1495828297  null:null   0   Error in process <0.21018.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:37.625651 +0300    1495828297  null:null   0   Error in process <0.21022.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:38.192580 +0300    1495828298  null:null   0   Error in process <0.21024.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:38.461708 +0300    1495828298  null:null   0   Error in process <0.21025.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:38.462431 +0300    1495828298  null:null   0   Error in process <0.21026.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:39.324727 +0300    1495828299  null:null   0   Error in process <0.21033.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:39.851241 +0300    1495828299  null:null   0   Error in process <0.21043.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:40.5627 +0300  1495828300  null:null   0   Error in process <0.21049.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:40.369284 +0300    1495828300  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:51:40.523795 +0300    1495828300  null:null   0   Error in process <0.21050.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:41.56663 +0300 1495828301  null:null   0   Error in process <0.21052.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:41.317741 +0300    1495828301  null:null   0   Error in process <0.21057.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:42.785978 +0300    1495828302  null:null   0   Error in process <0.21069.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:42.812650 +0300    1495828302  null:null   0   Error in process <0.21070.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:42.984686 +0300    1495828302  null:null   0   Error in process <0.21071.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:43.815766 +0300    1495828303  null:null   0   Error in process <0.21078.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:44.817129 +0300    1495828304  null:null   0   Error in process <0.21085.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:45.370117 +0300    1495828305  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:51:46.199487 +0300    1495828306  null:null   0   Error in process <0.21097.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:46.502452 +0300    1495828306  null:null   0   Error in process <0.21099.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:47.770769 +0300    1495828307  null:null   0   Error in process <0.21103.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:47.987768 +0300    1495828307  null:null   0   Error in process <0.21108.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:48.516769 +0300    1495828308  null:null   0   Error in process <0.21112.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:48.524799 +0300    1495828308  null:null   0   Error in process <0.21113.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:48.813618 +0300    1495828308  null:null   0   Error in process <0.21114.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:50.370898 +0300    1495828310  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:51:50.872671 +0300    1495828310  null:null   0   Error in process <0.21136.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:51:55.372095 +0300    1495828315  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:52:00.373178 +0300    1495828320  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:52:05.373913 +0300    1495828325  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:52:10.375174 +0300    1495828330  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:52:15.375872 +0300    1495828335  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:52:20.376915 +0300    1495828340  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:52:25.377929 +0300    1495828345  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:52:30.378945 +0300    1495828350  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:52:35.379846 +0300    1495828355  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:52:40.381247 +0300    1495828360  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:52:45.381901 +0300    1495828365  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:52:50.383154 +0300    1495828370  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}

The node isn't working at this point. I restart it, queue starts to process:

 leo_async_deletion_queue       |   running   | 122351         | 160            | 950            | async deletion of objs                      

Error log is typical at first:

[W] storage_1@192.168.3.54  2017-05-26 22:54:44.83565 +0300 1495828484  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/d3/71/30/d37130689a5bb04e1270e85a0442d9944112eb84949360e0732c7313b91eaaf1ccbce0e74a0b9f88917377fe9d08127c38935f0000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-26 22:54:44.690582 +0300    1495828484  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/dc/ad/01/dcad01a27ba985514931ae379940fcd8021ecaab7d47e948eb41b3bdeac6808c305a2fd15fbe015dc4c2a542be000846107f830000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-26 22:54:45.79657 +0300 1495828485  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/d3/71/30/d37130689a5bb04e1270e85a0442d9944112eb84949360e0732c7313b91eaaf1ccbce0e74a0b9f88917377fe9d08127c38935f0000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-26 22:54:45.689791 +0300    1495828485  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/dc/ad/01/dcad01a27ba985514931ae379940fcd8021ecaab7d47e948eb41b3bdeac6808c305a2fd15fbe015dc4c2a542be000846107f830000000000.xz">>},{cause,timeout}]

but then mq-stats command starts to freeze, and I get this:

[E] storage_1@192.168.3.54  2017-05-26 22:55:35.421877 +0300    1495828535  null:null   0   gen_fsm leo_async_deletion_queue_consumer_4_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:55:35.670420 +0300    1495828535  leo_mq_server:handle_call/3 287 {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,{first_n,0},30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:55:35.851418 +0300    1495828535  null:null   0   gen_fsm leo_async_deletion_queue_consumer_3_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:55:35.964534 +0300    1495828535  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_3_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-26 22:55:35.966858 +0300    1495828535  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_4_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-26 22:55:35.967659 +0300    1495828535  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.331.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:55:35.968591 +0300    1495828535  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.329.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:55:40.252705 +0300    1495828540  null:null   0   gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:55:40.273471 +0300    1495828540  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-26 22:55:40.274015 +0300    1495828540  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.333.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:55:45.382167 +0300    1495828545  null:null   0   gen_fsm leo_async_deletion_queue_consumer_1_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:55:45.383698 +0300    1495828545  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-26 22:55:45.384491 +0300    1495828545  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.335.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:56:06.248006 +0300    1495828566  null:null   0   gen_fsm leo_async_deletion_queue_consumer_4_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:56:06.248610 +0300    1495828566  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_4_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-26 22:56:06.249397 +0300    1495828566  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.14436.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:56:06.618153 +0300    1495828566  null:null   0   gen_fsm leo_async_deletion_queue_consumer_3_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:56:06.618743 +0300    1495828566  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_3_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-26 22:56:06.619501 +0300    1495828566  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.14435.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:56:06.619996 +0300    1495828566  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.14435.0> exit with reason reached_max_restart_intensity in context shutdown
[E] storage_1@192.168.3.54  2017-05-26 22:56:06.620377 +0300    1495828566  null:null   0   Supervisor leo_redundant_manager_sup had child undefined started with leo_mq_sup:start_link() at <0.222.0> exit with reason shutdown in context child_terminated
[E] storage_1@192.168.3.54  2017-05-26 22:56:07.236507 +0300    1495828567  null:null   0   Error in process <0.14718.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:07.395666 +0300    1495828567  null:null   0   Error in process <0.14719.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:07.589406 +0300    1495828567  null:null   0   Error in process <0.14721.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:08.34491 +0300 1495828568  null:null   0   Error in process <0.14722.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:08.553459 +0300    1495828568  null:null   0   Error in process <0.14724.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:08.699552 +0300    1495828568  null:null   0   Error in process <0.14726.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:08.750870 +0300    1495828568  null:null   0   Error in process <0.14727.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:09.395709 +0300    1495828569  null:null   0   Error in process <0.14741.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:09.429783 +0300    1495828569  null:null   0   Error in process <0.14742.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:09.536674 +0300    1495828569  null:null   0   Error in process <0.14743.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:09.670552 +0300    1495828569  null:null   0   Error in process <0.14748.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:10.239008 +0300    1495828570  null:null   0   Error in process <0.14754.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:10.395451 +0300    1495828570  null:null   0   Error in process <0.14755.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:10.872669 +0300    1495828570  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54  2017-05-26 22:56:11.79527 +0300 1495828571  null:null   0   Error in process <0.14758.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:11.89153 +0300 1495828571  null:null   0   Error in process <0.14760.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:11.93206 +0300 1495828571  null:null   0   Error in process <0.14761.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:11.291948 +0300    1495828571  null:null   0   Error in process <0.14762.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:11.336069 +0300    1495828571  null:null   0   Error in process <0.14763.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:11.608531 +0300    1495828571  null:null   0   Error in process <0.14769.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:12.78531 +0300 1495828572  null:null   0   Error in process <0.14770.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:12.461563 +0300    1495828572  null:null   0   Error in process <0.14772.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:12.689473 +0300    1495828572  null:null   0   Error in process <0.14773.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:12.812491 +0300    1495828572  null:null   0   Error in process <0.14774.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:14.902513 +0300    1495828574  null:null   0   Error in process <0.14793.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:15.250434 +0300    1495828575  null:null   0   Error in process <0.14800.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:15.266418 +0300    1495828575  null:null   0   Error in process <0.14801.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}

[E] storage_1@192.168.3.54  2017-05-26 22:56:15.873589 +0300    1495828575  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}

(the last line starts to repeat at this point)

Note that the empty lines in log file are really there. In other words, the current problems are: 1) Queue processing is still freezing without any directly related errors (as shown by storage_2) 2) Something scary is going on storage_1 (EDIT: fixed names of nodes). 3) The delete process isn't finished, there are still objects in bucket (however, I assume this is expected, given badargs in eleveldb,async_iterator early on)

Problem 3) is probably fine for now given that https://github.com/leo-project/leofs/issues/725#issuecomment-302606104 isn't implemented yet, I suppose, but 1) and 2) worry me as I don't see any currently open issues related to these problems...

yosukehara commented 7 years ago

I've updated the diagram of the deletion bucket processing, which covers https://github.com/leo-project/leofs/issues/725#issuecomment-302606104

leofs-deletion-bucket-proc

mocchira commented 7 years ago

@yosukehara Thanks for updating and taking my comments into account.

Some comments.

It seems the other concerns have been covered by the above diagram. Thanks for your hard work.

mocchira commented 7 years ago

@vstax thanks for testing.

As you concerned, the problem 1, 2 give me the impression there are something we have not covered yet. I will dig in further later (Now I have one hypothesis that could explain problem 1, 2)

Note: If you find reached_max_restart_intensity in an error.log then that means something goes pretty bad (Some erlang processes that are suppose to exist go down permanently due to the number of restarts reached a certain threshold in a specific time). Please restart a server if you face such cases in production. we'd like to tackle this problem somehow (like restarting automatically without human intervention) as another issue.

mocchira commented 7 years ago

@vstax I guess #744 can be the root cause of problem 1, 2 here so please try the same test with the latest leo_mq if you can spare time?

vstax commented 7 years ago

@mocchira I did the tests with leo_mq "devel" version. Unfortunately, results are still not good (but maybe better than before? I feel like it's somewhat better, at least logs seem to be less scary, but it might be because of something else).

At first I was restarting cluster. I upgraded and restarted storage_2 after storage_1 which was already working on new version. So as storage_1 tried to process its queue, it got into problems because of #728:

[W] storage_1@192.168.3.54  2017-05-30 18:59:24.508983 +0300    1496159964  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',{'EXIT',{badarg,[{ets,lookup,[leo_env_values,{env,leo_redundant_manager,server_type}],[]},{leo_misc,get_env,3,[{file,"src/leo_misc.erl"},{line,127}]},{leo_redundant_manager_api,table_info,1,[{file,"src/leo_redundant_manager_api.erl"},{line,1218}]},{leo_redundant_manager_api,checksum,1,[{file,"src/leo_redundant_manager_api.erl"},{line,368}]},{rpc,'-handle_call_call/6-fun-0-',5,[{file,"rpc.erl"},{line,206}]}]}}}
[W] storage_1@192.168.3.54  2017-05-30 18:59:29.546031 +0300    1496159969  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54  2017-05-30 18:59:34.549199 +0300    1496159974  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54  2017-05-30 18:59:39.566360 +0300    1496159979  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54  2017-05-30 18:59:44.571165 +0300    1496159984  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54  2017-05-30 18:59:49.576512 +0300    1496159989  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54  2017-05-30 18:59:51.140253 +0300    1496159991  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/34/33/8b/34338b4623dbdc08681b9d4c2697835cc8d5dba2046342060b1acdbf3c90308c98f3adb2de4d65a467a9d75591cd3e6db0c27b0000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-30 18:59:52.137261 +0300    1496159992  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/34/33/8b/34338b4623dbdc08681b9d4c2697835cc8d5dba2046342060b1acdbf3c90308c98f3adb2de4d65a467a9d75591cd3e6db0c27b0000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-30 18:59:53.287177 +0300    1496159993  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/10/0f/24/100f2430737d73d12a8c26fdb3cf2cae1fd5c2d3a72e11889842ccaa62ee59be0ddb3a9634110c65672b230966f898d47ca7000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-30 18:59:54.272404 +0300    1496159994  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/10/0f/24/100f2430737d73d12a8c26fdb3cf2cae1fd5c2d3a72e11889842ccaa62ee59be0ddb3a9634110c65672b230966f898d47ca7000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-30 19:00:04.629192 +0300    1496160004  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54  2017-05-30 19:00:09.635931 +0300    1496160009  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}
[E] storage_1@192.168.3.54  2017-05-30 19:00:42.642626 +0300    1496160042  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.323.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-30 19:00:42.649184 +0300    1496160042  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.319.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-30 19:00:45.46415 +0300 1496160045  null:null   0   gen_fsm leo_async_deletion_queue_consumer_3_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-30 19:00:45.46836 +0300 1496160045  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_3_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-30 19:00:45.47366 +0300 1496160045  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.321.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[W] storage_1@192.168.3.54  2017-05-30 19:00:47.401982 +0300    1496160047  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}
[E] storage_1@192.168.3.54  2017-05-30 19:00:52.385800 +0300    1496160052  null:null   0   gen_fsm leo_async_deletion_queue_consumer_1_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-30 19:00:52.386242 +0300    1496160052  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-30 19:00:52.386837 +0300    1496160052  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.325.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[W] storage_1@192.168.3.54  2017-05-30 19:01:07.572285 +0300    1496160067  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54  2017-05-30 19:01:12.584661 +0300    1496160072  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54  2017-05-30 19:01:17.630653 +0300    1496160077  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54  2017-05-30 19:01:22.636858 +0300    1496160082  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54  2017-05-30 19:01:37.729682 +0300    1496160097  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}

[skipped]

[W] storage_1@192.168.3.54  2017-05-30 19:02:12.840406 +0300    1496160132  leo_membership_cluster_local:compare_with_remote_chksum/3   405 {'storage_2@192.168.3.55',nodedown}

during which the queue was barely consuming, this change took over a minute:

 leo_async_deletion_queue       |   running   | 121462         | 1600           | 500            | async deletion of objs
 leo_async_deletion_queue       |   running   | 121462         | 1600           | 500            | async deletion of objs
 leo_async_deletion_queue       |   running   | 121462         | 1600           | 500            | async deletion of objs
 leo_async_deletion_queue       |   running   | 121283         | 640            | 800            | async deletion of objs
 leo_async_deletion_queue       |   running   | 121283         | 320            | 900            | async deletion of objs
 leo_async_deletion_queue       |   running   | 121283         | 160            | 950            | async deletion of objs

This is, of course, a known problem; I only wanted to note that until #728 is fixed, doing "delete bucket" operation with even a single node down - or restarting node, if the delete process is going on - seems to be extremely problematic.

After that at 19:02 storage_2 had launched, at which point the queue processing stopped entirely! There was nothing in error log anymore. I waited for 3 minutes or so, nothing was going on at all. The queue didn't want to process:

 leo_async_deletion_queue       |   idling    | 121227         | 0              | 1400           | async deletion of objs
 leo_async_deletion_queue       |   running   | 121227         | 0              | 1950           | async deletion of objs     
 leo_async_deletion_queue       |   idling    | 121227         | 0              | 2400           | async deletion of objs
 leo_async_deletion_queue       |   idling    | 121227         | 0              | 2650           | async deletion of objs

This is a problem, of course; no idea if it's yet another one or the same (queue processing freezed), but this time caused by another node being down for some time.

I restarted storage_1 at this point (~ at 19:05:03). The queue started to process (I was checking mq-stats every 5-10 seconds):

 leo_async_deletion_queue       |   running   | 120390         | 1600           | 500            | async deletion of objs
 leo_async_deletion_queue       |   running   | 119837         | 1600           | 500            | async deletion of objs
 leo_async_deletion_queue       |   running   | 118829         | 1600           | 500            | async deletion of objs
 leo_async_deletion_queue       |   running   | 115301         | 1280           | 600            | async deletion of objs
 leo_async_deletion_queue       |   running   | 113549         | 1120           | 650            | async deletion of objs
 leo_async_deletion_queue       |   running   | 112752         | 1120           | 650            | async deletion of objs
 leo_async_deletion_queue       |   running   | 110715         | 960            | 700            | async deletion of objs
 leo_async_deletion_queue       |   running   | 107220         | 480            | 850            | async deletion of objs
 leo_async_deletion_queue       |   running   | 106909         | 320            | 900            | async deletion of objs
 leo_async_deletion_queue       |   running   | 106429         | 320            | 900            | async deletion of objs

around this time "mq-stats" started to freeze (that is, 5-7 seconds before I got a reply). Short time after, this appeared in error log:

[E] storage_1@192.168.3.54  2017-05-30 19:06:45.586377 +0300    1496160405  leo_mq_server:handle_call/3 285 {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,{first_n,0},30000]}}
[E] storage_1@192.168.3.54  2017-05-30 19:06:45.587920 +0300    1496160405  null:null   0   gen_fsm leo_async_deletion_queue_consumer_2_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-30 19:06:45.618785 +0300    1496160405  null:null   0   gen_fsm leo_async_deletion_queue_consumer_3_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-30 19:06:46.448213 +0300    1496160406  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-30 19:06:46.450600 +0300    1496160406  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_3_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-30 19:06:46.451757 +0300    1496160406  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.313.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-30 19:06:46.453556 +0300    1496160406  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.311.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-30 19:06:55.715273 +0300    1496160415  null:null   0   gen_fsm leo_async_deletion_queue_consumer_1_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-30 19:06:55.715597 +0300    1496160415  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-30 19:06:55.716086 +0300    1496160415  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.315.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated

something apparently restarted in the node at this point, when ~ 20 second later node started to respond again and I was able to get next mq-stats results, the queue was consuming once again:

 leo_async_deletion_queue       |   running   | 91788          | 1120           | 650            | async deletion of objs
 leo_async_deletion_queue       |   running   | 83837          | 640            | 800            | async deletion of objs
 leo_async_deletion_queue       |   running   | 81511          | 480            | 850            | async deletion of objs
 leo_async_deletion_queue       |   running   | 76937          | 160            | 950            | async deletion of objs

Here mq-stats started to freeze again. Few seconds later this appeared in log file:

[E] storage_1@192.168.3.54  2017-05-30 19:08:13.853216 +0300    1496160493  leo_mq_server:handle_call/3 285 {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,{first_n,0},30000]}}
[E] storage_1@192.168.3.54  2017-05-30 19:08:14.133559 +0300    1496160494  null:null   0   gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-30 19:08:14.134668 +0300    1496160494  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-30 19:08:14.145193 +0300    1496160494  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.11208.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-05-30 19:08:23.855101 +0300    1496160503  null:null   0   gen_fsm leo_async_deletion_queue_consumer_1_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-30 19:08:23.855684 +0300    1496160503  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54  2017-05-30 19:08:23.856325 +0300    1496160503  null:null   0   Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.15227.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated

After short time, something restarted again and queue started to consume for a short while, but then stopped:

 leo_async_deletion_queue       |   running   | 68152          | 1280           | 600            | async deletion of objs
 leo_async_deletion_queue       |   running   | 59989          | 0              | 2100           | async deletion of objs
 leo_async_deletion_queue       | suspending  | 59989          | 0              | 1250           | async deletion of objs
 leo_async_deletion_queue       | suspending  | 59989          | 0              | 1250           | async deletion of objs
 leo_async_deletion_queue       |   idling    | 59989          | 0              | 1550           | async deletion of objs
 leo_async_deletion_queue       |   idling    | 59989          | 0              | 1550           | async deletion of objs
 leo_async_deletion_queue       |   idling    | 59989          | 0              | 1850           | async deletion of objs
 leo_async_deletion_queue       |   idling    | 59989          | 0              | 2500           | async deletion of objs

This time queue stopped processing for good, and there were no errors in logs at all. There is nothing in error logs of other nodes as well (except for obvious errors when I restarted storage_1). One thing that I'm noticing is that it generally takes roughly the same time (1:10 - 1:30) every time before problem appears.

mocchira commented 7 years ago

@vstax Thanks for the additional try. It seems leo_backend_db used by leo_mq got stuck for some reason. Besides normal operations, one possible reason causing leo_backend_db to get stuck is too much orders coming from leo_watchdog like

In order to confirm the above assumption is correct, please do the same test with leo_watchdog disabled? I will vet other possibilities.

also https://github.com/leo-project/leofs/issues/746 should mitigate this problem so please try after the PR for #746 is merged into develop.

Edit: Now the PR for #746 already got merged into develop.

vstax commented 7 years ago

@mocchira Thank you for advice! The results are amazing, the problems 1 and 2 are gone completely once I turned off disk watchdog. I remember watchdog giving problem in the past during upload experiments, but once you've implemented https://github.com/leo-project/leo_watchdog/commit/8a30a1730ea376439b6764e02d9c875996629d39 they disappeared and I was able to do massive parallel uploads with watchdog enabled, so I kind of left it like that, thinking it shouldn't create any more problems. Apparently I was wrong, it affects these massive deletes as well.

I should note that disk utilization watchdog never triggered, but I had disk capacity watchdog trigger all the time on storage_2 and storage_1 (I tried moving thresholds before to get rid of it, but it didn't work. I've filled #747 about it now).

With disk watchdog disabled (note that I didn't try the very latest leo_mq with fix for #746 yet) the queues are processed very smoothly and always to 0. They only show states "running" and "idling" now and batch size / interval don't fluctuate anymore, always being 1600 / 500.

As interesting note, badarg from eleveldb which happened before soon after initial "delete bucket" request - after which delete queues stopped growing and started to consume - doesn't happen anymore as well. But the queues still behave like that, they grow fast to certain point (70-90K messages on each node), then start consuming. So apparently that badarg wasn't the cause of why "delete bucket" operation stops adding messages to queue, it must be something else. The only errors now are, on gateway:

[W] gateway_0@192.168.3.52  2017-05-31 18:39:02.156972 +0300    1496245142  leo_gateway_s3_api:delete_bucket_2/31798    [{cause,timeout}]
[W] gateway_0@192.168.3.52  2017-05-31 18:39:07.158998 +0300    1496245147  leo_gateway_s3_api:delete_bucket_2/31798    [{cause,timeout}]

It's always twice, secfond one 5 seconds after the first, even though I only send delete request once and then terminate the client.

And typical ones for storage nodes:

[W] storage_0@192.168.3.53  2017-05-31 18:40:44.960991 +0300    1496245244  leo_storage_replicator:loop/6   216[{method,delete},{key,<<"bodytest/3a/fd/9b/3afd9b09a43527084a2d099b31989faff4755d2e31a99f0823d3d1501592513dedd1defe1b25cb1aa7e061cdb7d5c47e0400000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-05-31 18:40:45.961997 +0300    1496245245  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/3a/fd/9b/3afd9b09a43527084a2d099b31989faff4755d2e31a99f0823d3d1501592513dedd1defe1b25cb1aa7e061cdb7d5c47e0400000000000000.xz">>},{cause,timeout}]

and

[W] storage_1@192.168.3.54  2017-05-31 18:40:37.974512 +0300    1496245237  leo_storage_replicator:loop/6   216[{method,delete},{key,<<"bodytest/00/e6/66/00e666b4e8c62d42ccabb1b05c8111df33bf4c8a042881df5fcd33bde3d709f41c219f7cdd7ed58ec9947657201c5afa0010000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-31 18:40:37.981806 +0300    1496245237  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/00/e6/66/00e666b4e8c62d42ccabb1b05c8111df33bf4c8a042881df5fcd33bde3d709f41c219f7cdd7ed58ec9947657201c5afa0010000000000000.xz">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_7,{get,<<"bodytest/00/e6/66/00e666b4e8c62d42ccabb1b05c8111df33bf4c8a042881df5fcd33bde3d709f41c219f7cdd7ed58ec9947657201c5afa0010000000000000.xz">>},30000]}}}]
[W] storage_1@192.168.3.54  2017-05-31 18:40:38.977943 +0300    1496245238  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/00/e6/66/00e666b4e8c62d42ccabb1b05c8111df33bf4c8a042881df5fcd33bde3d709f41c219f7cdd7ed58ec9947657201c5afa0010000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-31 18:41:08.981477 +0300    1496245268  leo_storage_replicator:loop/6   216[{method,delete},{key,<<"bodytest/1d/43/9f/1d439fc40911b910064d4eba9f4d8c6f83cb6915a1e92da99170b47c996c200574106e99823b64c0fa15a0f973354cffa420010000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-05-31 18:41:09.981144 +0300    1496245269  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/1d/43/9f/1d439fc40911b910064d4eba9f4d8c6f83cb6915a1e92da99170b47c996c200574106e99823b64c0fa15a0f973354cffa420010000000000.xz">>},{cause,timeout}]

(here the second one with "replicate_fun" is a bit different from others)

mocchira commented 7 years ago

@vstax Good to hear that works for you.

I remember watchdog giving problem in the past during upload experiments, but once you've implemented leo-project/leo_watchdog@8a30a17 they disappeared and I was able to do massive parallel uploads with watchdog enabled, so I kind of left it like that, thinking it shouldn't create any more problems. Apparently I was wrong, it affects these massive deletes as well.

It would rather be good for all of us because we could find another culprit in leo_watchdog :) I will file the issue that leo_watchdog for the disk could make leo_backend_db overloaded as another one later.

As interesting note, badarg from eleveldb which happened before soon after initial "delete bucket" request - after which delete queues stopped growing and started to consume - doesn't happen anymore as well. But the queues still behave like that, they grow fast to certain point (70-90K messages on each node), then start consuming. So apparently that badarg wasn't the cause of why "delete bucket" operation stops adding messages to queue, it must be something else. The only errors now are, on gateway:

It's still not clear but I have one assumption that could explain why delete-bucket operations could stop. As the reason is involved with how the Erlang runtime behave when an erlang process have lots of messages in its mailbox, the detailed/right explanation is too complicated so I'd like to cut long story short

In case of our delete-bucket case, Erlang processes that try to add messages in the leo_mq could block for a long time by Erlang Scheduler when the corresponding leo_mq_server process have lots of items in their mailbox. You can think of this mechanism as kind of back pressure algorithm in order to prevent a busy process from getting stuck with too much messages sent by others.

It's always twice, secfond one 5 seconds after the first, even though I only send delete request once and then terminate the client.

This will be solved when we finish to implement https://github.com/leo-project/leofs/issues/725#issuecomment-304567505.

Any error caused by timeout is thought as the result of punishments by Erlang Scheduler described above. we will try to make that happen as less times as possible anyway.

mocchira commented 7 years ago

@vstax I'd like to ask you to do the same test with leo_watchdog_disk enabled after https://github.com/leo-project/leo_watchdog/pull/6 merged into the develop. That PR will fix delete-bucket problem even if watchdog for disk enabled.

vstax commented 7 years ago

@mocchira Unfortunately, it will take me some time to do this as it seems I need to wipe the whole cluster and fill it with data before trying again. I've tried removing another bucket, "body" (which has about the same amount of objects as bodytest had, around 1M) but get a problem: first of all, s3cmd completes instantly:

$ s3cmd rb s3://body
Bucket 's3://body/' removed

HTTP log:

<- DELETE http://body.s3.amazonaws.com/ HTTP/1.1
-> HTTP/1.1 204 No Content

Then I get this error on each storage node:

[E] storage_1@192.168.3.54  2017-06-02 20:48:36.825216 +0300    1496425716  leo_backend_db_eleveldb:prefix_search/3 223 {badrecord,metadata_3}

and that's it. Nothing else happens; I can create bucket again and it contains all the objects it did before.

E.g.

[root@leo-m0 ~]# /usr/local/bin/leofs-adm whereis body/eb/63/27/eb6327350a33926f4045d7138bec30fac791d4648265a4c387b03858c13a4931696a0058b7071810a4639a01d8de0c8c0079010000000000.xz
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | storage_2@192.168.3.55      | c5a3967eea6e948e4a2cd59a1b5c1866     |        47K |   9b271ea883 | false          |              0 | 550fcb2fa9862  | 2017-06-02 19:32:28 +0300
       | storage_1@192.168.3.54      | c5a3967eea6e948e4a2cd59a1b5c1866     |        47K |   9b271ea883 | false          |              0 | 550fcb2fa9862  | 2017-06-02 19:32:28 +0300

I've tried various settings and versions but always getting the same error.

"bodytest" seems to be no good anymore because I exhausted it, after a few tries there are no more objects in there (though I still get "timeout" twice on gateway and storage nodes produce CPU load for a minute or so - expected, I guess, at least until compaction is performed). Since removing "body" doesn't work at all, I'll need to fill cluster with whole new data.

Technically there should be no difference between "body" and "bodytest" buckets except for the fact that "bodytest" contains all objects in "subdirectories" like 00/01/01, while body contains that AND around 10K objects directly in the bucket without any "subdirectories" - which makes ls operation on it impossible.

mocchira commented 7 years ago

@vstax

Thanks for trying. It turned out that you hit the another issue when deleting the body bucket. Now objects created by LeoFS <= 1.3.2.1 are NOT removable with LeoFS >= 1.3.3 through a delete-bucket operation. I will file this issue later. So please give it another try once the issue will be fixed.

EDIT: filed the issue on https://github.com/leo-project/leofs/issues/754

vstax commented 7 years ago

@mocchira Nice, thanks. I'll test this fix some time after as I've moved to other experiments (but I kept the copy of data which gave me the problem with removing bucket).

Regarding fix from https://github.com/leo-project/leo_watchdog/pull/6 - it doesn't seem to do anything for me. The problem still persists on storage_1 and storage_2 which give watchdog warning. (warning, not an error - like you described at #747). I have latest leo_mq and leo_watchdog (0.12.8).

Unfiltered log of storage_1 looks like this:

[W] storage_1@192.168.3.54  2017-06-06 17:05:13.880756 +0300    1496757913  leo_watchdog_disk:check/4   307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,205360440},{available,39413936},{use_percentage,85},{use_percentage_str,"84%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_1@192.168.3.54  2017-06-06 17:05:23.894801 +0300    1496757923  leo_watchdog_disk:check/4   307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,205360440},{available,39413936},{use_percentage,85},{use_percentage_str,"84%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_1@192.168.3.54  2017-06-06 17:05:24.708635 +0300    1496757924  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/39/58/a6/3958a6e0b7e1f33eaec7c5634498bb65579d13b8dff8983943b9144359b74206627e398b44a7ab0a29eb00169a9651230600100000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-06-06 17:05:25.708490 +0300    1496757925  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/39/58/a6/3958a6e0b7e1f33eaec7c5634498bb65579d13b8dff8983943b9144359b74206627e398b44a7ab0a29eb00169a9651230600100000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-06-06 17:05:33.913248 +0300    1496757933  leo_watchdog_disk:check/4   307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,205360440},{available,39413936},{use_percentage,85},{use_percentage_str,"84%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_1@192.168.3.54  2017-06-06 17:05:43.921983 +0300    1496757943  leo_watchdog_disk:check/4   307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,205360556},{available,39413820},{use_percentage,85},{use_percentage_str,"84%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_1@192.168.3.54  2017-06-06 17:05:53.928851 +0300    1496757953  leo_watchdog_disk:check/4   307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,205364092},{available,39410284},{use_percentage,85},{use_percentage_str,"84%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]

(and so on)

Same for storage_2:

[W] storage_2@192.168.3.55  2017-06-06 17:04:43.514332 +0300    1496757883  leo_watchdog_disk:check/4   307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202829220},{available,41945156},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_2@192.168.3.55  2017-06-06 17:04:53.526332 +0300    1496757893  leo_watchdog_disk:check/4   307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202829220},{available,41945156},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_2@192.168.3.55  2017-06-06 17:05:03.536143 +0300    1496757903  leo_watchdog_disk:check/4   307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202829220},{available,41945156},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_2@192.168.3.55  2017-06-06 17:05:12.791329 +0300    1496757912  leo_storage_handler_object:replicate_fun/3  1385    [{cause,"Could not get a metadata"}]
[E] storage_2@192.168.3.55  2017-06-06 17:05:12.791688 +0300    1496757912  leo_storage_handler_object:put/4    416 [{from,storage},{method,delete},{key,<<"bodytest/01/35/3f/01353fb26857021c29f1bfc11cd74d4f7ef7adc5a075f4dbaec11f2f79a37d1a8bf33f4afa1c0c22e8e012552e1a3c8e2c03010000000000.xz">>},{req_id,0},{cause,"Could not get a metadata"}]
[W] storage_2@192.168.3.55  2017-06-06 17:05:12.802626 +0300    1496757912  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/01/35/3f/01353fb26857021c29f1bfc11cd74d4f7ef7adc5a075f4dbaec11f2f79a37d1a8bf33f4afa1c0c22e8e012552e1a3c8e2c03010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-06-06 17:05:13.547962 +0300    1496757913  leo_watchdog_disk:check/4   307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202829220},{available,41945156},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_2@192.168.3.55  2017-06-06 17:05:13.819000 +0300    1496757913  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/01/35/3f/01353fb26857021c29f1bfc11cd74d4f7ef7adc5a075f4dbaec11f2f79a37d1a8bf33f4afa1c0c22e8e012552e1a3c8e2c03010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-06-06 17:05:23.570099 +0300    1496757923  leo_watchdog_disk:check/4   307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202829220},{available,41945156},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_2@192.168.3.55  2017-06-06 17:05:33.585550 +0300    1496757933  leo_watchdog_disk:check/4   307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202829220},{available,41945156},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]

On both these nodes leo_async_deletion_queue processing freezes soon after these "timeout" messages. On storage_0, for which watchdog doesn't trigger, the queue processes fine, despite same timeouts:

[W] storage_0@192.168.3.53  2017-06-06 17:05:12.775529 +0300    1496757912  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/01/35/3f/01353fb26857021c29f1bfc11cd74d4f7ef7adc5a075f4dbaec11f2f79a37d1a8bf33f4afa1c0c22e8e012552e1a3c8e2c03010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-06-06 17:05:13.770076 +0300    1496757913  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/01/35/3f/01353fb26857021c29f1bfc11cd74d4f7ef7adc5a075f4dbaec11f2f79a37d1a8bf33f4afa1c0c22e8e012552e1a3c8e2c03010000000000.xz">>},{cause,timeout}]

This is reproduced each time. If I disable watchdog on storage_1 and storage_2, they process messages fine, just like storage_0 (for which disk space watchdog is enabled but does not trigger).

As a side note: leofs-adm du <storage-node> command, which normally executes instantly, hangs and eventually timeouts during delete bucket operation (initial part, when queues are filling). When queues stop filling and start consuming, it works again. A bug / undocumented feature? I kind of though that it's pretty lightweight operation since normally you get result right away even under load, but now I'm not so sure.

mocchira commented 7 years ago

@vstax Thanks for retrying.

Regarding fix from leo-project/leo_watchdog#6 - it doesn't seem to do anything for me. The problem still persists on storage_1 and storage_2 which give watchdog warning. (warning, not an error - like you described at #747). I have latest leo_mq and leo_watchdog (0.12.8).

Got it. It seems there are still same kind of problems at other places so I will get back to vet again.

As a side note: leofs-adm du command, which normally executes instantly, hangs and eventually timeouts during delete bucket operation (initial part, when queues are filling). When queues stop filling and start consuming, it works again. A bug / undocumented feature? I kind of though that it's pretty lightweight operation since normally you get result right away even under load, but now I'm not so sure.

Yes it's kind of known problem (Any response from leofs-adm can get delayed when there are lots of tasks generated by background jobs) at least among devs. Those problems can be mitigated by fixing https://github.com/leo-project/leofs/issues/753 however I will file as another issue for sure. Thanks for reminding me of that.

vstax commented 7 years ago

@mocchira I've tested fix for #754 and it works perfectly, I was able to delete the original "body" bucket. In about 8-10 create+delete bucket operations, anyway. Just like before, each delete generates 70-90k delete messages per storage node, so with 3 storage nodes, 2 copies and 1M objects, it's supposed (currently) to be like that - amount of messages that is generated before the operation stops is quite stable.

It's a pretty slow operation overall so I tried to observe it some and one interesting thing that I've noticed is that deletes go through AVS files in straight order: first all objects that are in 0.avs are found / marked as deleted, then it goes to 1.avs and so on. This confused me a bit - is it by design / for simplicity? Well, this operation is not disk-bound anyway, though (I can see it being CPU bound by "Eleveldb" threads, even though these threads don't really reach 100% CPU on average. Kind of feels like there are some internal locks preventing it to go faster, either that or not enough processing threads or something like that).

Through, of course, I understand that there is no real need for "delete bucket" operation to be of high performance, as long as it's reliable. Just some random observation.

Yes it's kind of known problem (Any response from leofs-adm can get delayed when there are lots of tasks generated by background jobs) at least among devs.

The reason why I asked about du operation is that other response from leofs-adm is not delayed.. This is quite unlike the problem with watchdog/leo_backend_db which makes node use high CPU, doesn't allow you to get result from mq-stats operation and so on. When first stage of delete-bucket (filling of queues) is going on the node is very responsive overall. I get instant and precise numbers from mq-stats operation as well. It's only du that seems to be having problem during this time. Also I get (maybe wrong) feeling that it doesn't work at all until this stage is over, so it's not a simple delay. Right after queue stops filling, du starts to give instant response, even through the node is busy consuming these messages. So I thought this might be something else.

mocchira commented 7 years ago

@vstax

It's a pretty slow operation overall so I tried to observe it some and one interesting thing that I've noticed is that deletes go through AVS files in straight order: first all objects that are in 0.avs are found / marked as deleted, then it goes to 1.avs and so on. This confused me a bit - is it by design / for simplicity? Well, this operation is not disk-bound anyway, though (I can see it being CPU bound by "Eleveldb" threads, even though these threads don't really reach 100% CPU on average. Kind of feels like there are some internal locks preventing it to go faster, either that or not enough processing threads or something like that).

What roughly happens behind the scene are

  1. Iterate eleveldb to retrieve the objects that belong to the deleted bucket (Sequential Reads)
  2. Put the object's identity found by the prev step(1) into the async-deletion queue (Appends)
  3. Consume items stored in the async-deletion queue and do delete the corresponding objects (Sequential Reads and Appends)

and since there is one-to-one relationship between AVS and metadata(managed by eleveldb), as a result deletes happen from 1.avs to n.avs one by one. (Yes it's expected)

I think the reason why it's slow might be the sleep happen at regular intervals intentionally according to the configuration here https://github.com/leo-project/leofs/blob/1.3.4/apps/leo_storage/priv/leo_storage.conf#L236-L243 in order to reduce the load generated by background jobs.

So please retry with configuration tweaked more short intervals?

When first stage of delete-bucket (filling of queues) is going on the node is very responsive overall. I get instant and precise numbers from mq-stats operation as well. It's only du that seems to be having problem during this time. Also I get (maybe wrong) feeling that it doesn't work at all until this stage is over, so it's not a simple delay. Right after queue stops filling, du starts to give instant response, even through the node is busy consuming these messages. So I thought this might be something else.

Obviously something goes wrong! I will vet.

mocchira commented 7 years ago

@vstax turned out the reason why du can get stuck and filed on https://github.com/leo-project/leofs/issues/758.

vstax commented 7 years ago

@mocchira Nice, thank you. Our monitoring tries to execute du and compact-status to monitor detailed node statistics as this information (ratio of active size, compaction status and compaction start date) is not available over SNMP, so it's good to know that when it doesn't produce any results nothing is seriously broken and it's a known issue.

Regarding sequential AVS processing: processing AVS files in the same directory in order is hardly a problem, I was more interested in how it processes files on multiple JBOD drives on real storage node - when each drive has its own directory for AVS files; if it doesn't utilize all drives in parallel, that might be somewhat a problem under certain conditions (or might be not a problem, since I don't really see any IO load during any stage of delete-bucket operation). But as we are not running production LeoFS yet, I kind of just wondered about it beforehand.

Regarding sleep interval: strangely enough, it doesn't seem to be reason why it's slow. I've reduced interval 10 times and the speed is the same. I reduced intervals 500 times to original, setting "regular" to 1 msec and it's still the same. Example of monitoring queue once per second with sleep reduced to 1 msec:

[root@leo-m0 ~]# while sleep 1; do /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_; done
 leo_async_deletion_queue       |   running   | 58261          | 1600           | 1              | async deletion of objs
 leo_async_deletion_queue       |   running   | 57861          | 1600           | 1              | async deletion of objs
 leo_async_deletion_queue       |   running   | 57861          | 1600           | 1              | async deletion of objs
 leo_async_deletion_queue       |   running   | 57461          | 1600           | 1              | async deletion of objs
 leo_async_deletion_queue       |   running   | 57061          | 1600           | 1              | async deletion of objs
 leo_async_deletion_queue       |   running   | 57061          | 1600           | 1              | async deletion of objs
 leo_async_deletion_queue       |   running   | 56661          | 1600           | 1              | async deletion of objs
 leo_async_deletion_queue       |   running   | 56661          | 1600           | 1              | async deletion of objs
 leo_async_deletion_queue       |   running   | 56661          | 1600           | 1              | async deletion of objs
 leo_async_deletion_queue       |   running   | 56261          | 1600           | 1              | async deletion of objs

Top output at that time for beam.smp (per-thread):

top - 17:34:48 up 20:45,  1 user,  load average: 1,02, 0,57, 0,29
Threads: 131 total,   0 running, 131 sleeping,   0 stopped,   0 zombie
%Cpu0  :  8,5 us,  2,8 sy,  0,0 ni, 48,4 id,  0,0 wa,  0,0 hi,  0,4 si, 39,9 st
%Cpu1  :  2,3 us,  1,0 sy,  0,0 ni, 89,8 id,  0,0 wa,  0,0 hi,  0,0 si,  6,9 st
%Cpu2  :  8,1 us,  2,4 sy,  0,0 ni, 84,1 id,  0,0 wa,  0,0 hi,  0,0 si,  5,4 st
%Cpu3  :  4,0 us,  1,3 sy,  0,0 ni, 86,5 id,  0,0 wa,  0,0 hi,  0,0 si,  8,3 st
%Cpu4  :  0,3 us,  0,7 sy,  0,0 ni, 96,7 id,  0,0 wa,  0,0 hi,  0,0 si,  2,3 st
%Cpu5  :  7,5 us,  2,4 sy,  0,0 ni, 81,6 id,  0,0 wa,  0,0 hi,  0,0 si,  8,5 st
%Cpu6  :  1,3 us,  0,7 sy,  0,0 ni, 92,1 id,  1,0 wa,  0,0 hi,  0,0 si,  4,9 st
%Cpu7  :  0,0 us,  0,0 sy,  0,0 ni,100,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem :  8010004 total,   531088 free,   410228 used,  7068688 buff/cache
KiB Swap:  4194300 total,  4194296 free,        4 used.  7255088 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S P %CPU %MEM     TIME+ COMMAND
10550 leofs     20   0 5426228 219172  23572 S 0 39,2  2,7   1:23.18 1_scheduler
10638 leofs     20   0 5426228 219172  23572 S 1 27,2  2,7   1:50.93 Eleveldb
10558 leofs     20   0 5426228 219172  23572 S 1 17,6  2,7   0:37.84 aux
10551 leofs     20   0 5426228 219172  23572 S 6  7,2  2,7   0:44.62 2_scheduler
10526 leofs     20   0 5426228 219172  23572 S 6  4,1  2,7   0:05.02 async_10
10631 leofs     20   0 5426228 219172  23572 S 3  1,6  2,7   0:21.74 Eleveldb
10552 leofs     20   0 5426228 219172  23572 S 5  1,6  2,7   0:24.70 3_scheduler
10639 leofs     20   0 5426228 219172  23572 S 3  1,5  2,7   0:08.08 Eleveldb
10621 leofs     20   0 5426228 219172  23572 S 6  0,0  2,7   0:00.16 Eleveldb
10452 leofs     20   0 5426228 219172  23572 S 2  0,0  2,7   0:00.03 beam.smp
10515 leofs     20   0 5426228 219172  23572 S 3  0,0  2,7   0:00.00 sys_sig_dispatc
10516 leofs     20   0 5426228 219172  23572 S 6  0,0  2,7   0:00.00 sys_msg_dispatc

Another example:

  PID USER      PR  NI    VIRT    RES    SHR S P %CPU %MEM     TIME+ COMMAND                                                                                                                  
12638 leofs     20   0 5349500 212376  57516 R 5 19,9  2,7   0:45.31 aux                                                                                                                      
12631 leofs     20   0 5349500 212376  57516 R 0 19,3  2,7   0:50.50 2_scheduler                                                                                                              
12632 leofs     20   0 5349500 212376  57516 S 4 18,9  2,7   0:11.33 3_scheduler                                                                                                              
12718 leofs     20   0 5349500 212376  57516 S 3 16,3  2,7   0:44.68 Eleveldb                                                                                                                 
12630 leofs     20   0 5349500 212376  57516 S 0 14,6  2,7   1:25.27 1_scheduler                                                                                                              
12711 leofs     20   0 5349500 212376  57516 S 1  9,3  2,7   0:23.23 Eleveldb                                                                                                                 
12616 leofs     20   0 5349500 212376  57516 S 5  7,0  2,7   0:05.72 async_20                                                                                                                 
12704 leofs     20   0 5349500 212376  57516 S 1  5,3  2,7   0:04.01 Eleveldb                                                                                                                 
12712 leofs     20   0 5349500 212376  57516 S 1  1,3  2,7   0:05.32 Eleveldb                                                                                                                 
12719 leofs     20   0 5349500 212376  57516 S 5  1,0  2,7   0:13.96 Eleveldb                                                                                                                 
12633 leofs     20   0 5349500 212376  57516 S 4  0,7  2,7   0:00.56 4_scheduler                                                                                                              

It always looks something like that during queue processing - 1/2/3 _scheduler and aux threads consuming most of CPU. I've also gathered leo_doctor log here: https://pastebin.com/bBT5mLLA

But, well, like I said, it probably doesn't matter that much for now so I don't think you should worry about it (I'm writing about it in detail just in case you might spot some anomaly caused by some problem related to this ticket).

mocchira commented 7 years ago

@vstax

Thanks for your report in detail. The result gathered by leo_doctor revealed the fact that delete-bucket can be slow down due to the imbalanced items stored in async_deletion_queue that cause queue consumers to get stuck more than necessary. Now once the first phase of a delete-bucket done, items stored in async_deletion_queue look like the below.

| **ALL** items belonging to metadata_0 | **ALL** items belonging to metadata_1 | ... | **ALL** items belonging to metadata_n|
^ head

ALL items belonging to metadata_N converged into ONE big chunk. then the second phase that queue consumers pop items and delete the corresponding objects will work like the below.

consumer_0            consumer_1            consumer_2            consumer_3
     |                    |                     |                    |
     -----------------------------------------------------------------
                                         | <--- congestion could happen here
                             metadata_N, object_storage_N

To solve this issue, we may have to iterate metadata in parallel and produce items distributed evenly in each metadata. I will file as another issue later on. Thanks again.

mocchira commented 7 years ago

@vstax

This is reproduced each time. If I disable watchdog on storage_1 and storage_2, they process messages fine, just like storage_0 (for which disk space watchdog is enabled but does not trigger). Got it. It seems there are still same kind of problems at other places so I will get back to vet again.

It turned out that the overload problem has already gone and there is another beast that probably causes your problem filed here https://github.com/leo-project/leofs/issues/776.

mocchira commented 7 years ago

let me summarize the remained issues around here.

Please let me know if I'm missing something.

vstax commented 7 years ago

@mocchira Thank you, yes, you are quite right and it's indeed #776 that happens if disk watchdog enabled and the system has >80% used disk space (and it can't be tweaked through config right now because that hardcoded value creates this problem as well, I think).

I might be nitpicking, but there is still one thing that bothers me: you describe this problem as batch size being reduced to 0 so that queue processing stops. First question: isn't it bad in general, stopping processing completely? What if some other trigger in different watchdog - say, high CPU usage (maybe caused by something else) - does the same? Might be a a limit of how much watchdog can reduce batch size a good idea, so it can never reduce it to 0? Instead, if it's at some defined minimum value and watchdog triggers, it output big fat warning in log files that something seems to be really wrong. I'm just wondering if this (safe limit) would be more productive from operating perspective than stopping all processing.

Second question: there were errors in logs, e.g. from https://github.com/leo-project/leofs/issues/725#issuecomment-304376426

[E] storage_1@192.168.3.54  2017-05-26 22:50:53.757233 +0300    1495828253  null:null   0   gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54  2017-05-26 22:50:54.262824 +0300    1495828254  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]

and

[E] storage_1@192.168.3.54  2017-05-26 22:51:35.367975 +0300    1495828295  leo_watchdog_sub:handle_info/2  165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}

Both of these messages can repeat a lot (second one, with '-decrease/3-lc$^0/1-0-' endlessly, actually). Isn't that an indication of some problem by itself? I mean, queue being at 0 and processing stopped is one thing, but some parts of system seem to try to work under these conditions and experience problems, instead of detecting that their actions are impossible.

vstax commented 7 years ago

@mocchira regarding the issues referenced here, there is (obviously) a main issue that you described at https://github.com/leo-project/leofs/issues/725#issuecomment-302606104, but after it's done there is a need to check that all the related sub-issues are gone, i.e.

  1. handling the fact that client doesn't get reply to "delete bucket" and tries to repeat this operation a few times
  2. handling the fact that (currently) gateway gets timeouts and issues multiple delete bucket operations as well
  3. checking that "delete bucket" actually deletes everything (as right now it doesn't), including across node restarts and such
  4. creation of bucket being deleted shouldn't be allowed - however! - this can give it's own set of problems, e.g. right now I kind of "continue" delete bucket operation by creating the same bucket and deleting it again, if some other problems would still exist, but "create bucket" operation would be blocked, it couldn't be done this way.

(you've mentioned these problems before, I'm just writing them here for the checklist that you are creating)

I know of yet another issue, though I haven't reported it separately because I didn't do experiments yet. It's the one mentioned in #763:

[I] storage_1@192.168.3.54  2017-06-09 21:56:35.824217 +0300    1497034595  leo_compact_fsm_worker:running/2    392 {leo_compact_worker_7,{timeout,{gen_server,call,[leo_metadata_7,{put_value_to_new_db,<<"bodycopy/6b/e7/57/6be75707dc8329a3d120035c2ef2b28dbdc0a9bca7128d741df9cf7719f1743be6d587aa166409683b37358400c5f8efe814050000000000.xz">>,<<131,104,22,100,0,10,109,101,116,97,100,97,116,97,95,51,109,0,0,0,133,98,111,100,121,99,111,112,121,47,54,98,47,101,55,47,53,55,47,54,98,101,55,53,55,48,55,100,99,56,51,50,57,97,51,100,49,50,48,48,51,53,99,50,101,102,50,98,50,56,100,98,100,99,48,97,57,98,99,97,55,49,50,56,100,55,52,49,100,102,57,99,102,55,55,49,57,102,49,55,52,51,98,101,54,100,53,56,55,97,97,49,54,54,52,48,57,54,56,51,98,51,55,51,53,56,52,48,48,99,53,102,56,101,102,101,56,49,52,48,53,48,48,48,48,48,48,48,48,48,48,46,120,122,110,16,0,11,33,243,74,102,221,228,126,48,67,103,248,244,106,57,13,97,133,98,0,1,147,80,109,0,0,0,0,97,0,97,0,97,0,97,0,110,5,0,169,143,253,173,1,110,7,0,96,173,189,176,103,81,5,110,5,0,32,18,173,210,14,110,16,0,67,255,81,173,159,204,109,0,182,231,93,11,157,210,224,227,97,0,100,0,9,117,110,100,101,102,105,110,101,100,97,0,97,0,97,0,97,0,97,0,97,0>>},30000]}}}

I got a reason to believe that when "delete bucket" is processing objects from some AVS file (at least current, non-parallel version until #764 is implemented), the compaction for that AVS file, if it was going on at that moment, will fail. At very least I know that it happened for me for both nodes that were deleting objects from 7.avs - compaction for that file has failed on both of them, with info (!), not error message like above. I don't actually even have evidence that this message is a symptom of a compaction failing, it's more like:

delete bucket + compaction going through the same file =>
    this timeout happens =>
        compaction fails

but I yet have no evidence of the reason and a cause being like that yet. I think deletion is in the stage when objects listed in queue are actually deleted from the bucket, though.

mocchira commented 7 years ago

@vstax

I might be nitpicking, but there is still one thing that bothers me: you describe this problem as batch size being reduced to 0 so that queue processing stops. First question: isn't it bad in general, stopping processing completely? What if some other trigger in different watchdog - say, high CPU usage (maybe caused by something else) - does the same? Might be a a limit of how much watchdog can reduce batch size a good idea, so it can never reduce it to 0? Instead, if it's at some defined minimum value and watchdog triggers, it output big fat warning in log files that something seems to be really wrong. I'm just wondering if this (safe limit) would be more productive from operating perspective than stopping all processing.

Good point. We had the configurable minimum settings in the past however those had gone for some reason (I couldn't remember off the top of my head). It might be time to take another look at this idea.

Both of these messages can repeat a lot (second one, with '-decrease/3-lc$^0/1-0-' endlessly, actually). Isn't that an indication of some problem by itself? I mean, queue being at 0 and processing stopped is one thing, but some parts of system seem to try to work under these conditions and experience problems, instead of detecting that their actions are impossible.

The reason why those errors have happened is due to https://github.com/leo-project/leofs/issues/764 (a leo_backend_db that is corresponding with the congested leo_mq_server couldn't respond to requests sent through leo_mq_api) so fixing #764 should make those errors less likely to happen. Regarding decrease/3 called endlessly, I can't answer preciously as I'm not the original author however it seems some parts depend on the current behavior (decrease/3, increase/3 called endlessly). (please correct me if there are some wrong explanations @yosukehara

(you've mentioned these problems before, I'm just writing them here for the checklist that you are creating)

Thanks! that's really helpful to us.

I got a reason to believe that when "delete bucket" is processing objects from some AVS file (at least current, non-parallel version until #764 is implemented), the compaction for that AVS file, if it was going on at that moment, will fail. At very least I know that it happened for me for both nodes that were deleting objects from 7.avs - compaction for that file has failed on both of them, with info (!), not error message like above. I don't actually even have evidence that this message is a symptom of a compaction failing, it's more like:

764 also cause those failures you faced by invoking compact-start while delete-bucket is in progress. I should have to mention how #764 affect the system in more detail. In short, ANY operation that need to access a congested AVS/metadata caused by delete-bucket is highly likely to fail due to getting the mailbox of the corresponding erlang process filled with lots of delete operations. That said, not only compaction but also normal PUT/DELETE operations is likely to fail while delete-bucket is in progress. I'd consider raising its priority.

vstax commented 7 years ago

@mocchira I wanted to try how this works in latest devel version (with leo_manager version 1.3.5), but there seem to be complications. After restarting cluster with latest version I get this:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
[ERROR] Could not get records

Log files on both managers are full of these two errors that appear every 10 seconds:

[E] manager_0@192.168.3.50  2017-07-11 18:52:26.527367 +0300    1499788346  leo_manager_del_bucket_handler:handle_info/2, dequeue   219 [{cause,"Mnesia is not available"}]
[E] manager_0@192.168.3.50  2017-07-11 18:52:26.527879 +0300    1499788346  leo_manager_del_bucket_handler:handle_info/2, dequeue   229 [{cause,"Mnesia is not available"}]
[E] manager_0@192.168.3.50  2017-07-11 18:52:36.529233 +0300    1499788356  leo_manager_del_bucket_handler:handle_info/2, dequeue   219 [{cause,"Mnesia is not available"}]
[E] manager_0@192.168.3.50  2017-07-11 18:52:36.529698 +0300    1499788356  leo_manager_del_bucket_handler:handle_info/2, dequeue   229 [{cause,"Mnesia is not available"}]

I've tried restarting managers in different sequence and doing "start force-load" on master but it doesn't seem to change anything. The cluster seems to work fine otherwise. I can see lots of new queues in "mq-stats" output for storage nodes as well. There are no other interesting messages if enabling debug logs.

mocchira commented 7 years ago

@vstax Thanks for trying. It turned out that mnesia tables for the new delete-bucket implementation were not created in case of version upgrades. we will push the fix later.

EDIT: https://github.com/leo-project/leofs/pull/785 the fix for your problem has been merged into develop so please give it a try.

vstax commented 7 years ago

@mocchira Thank you, this fix helped.

Regarding delete bucket operation in general: I don't think it quite works on my system. The same test system: 3 nodes, N=2, D=1, 2+ millions of objects in cluster (1M in "bodytest" bucket, 1M in "bodycopy" and a small amount in few other buckets). It means that delete bucket operation should remove roughly 650,000 objects on each node. Nodes are configured properly (num_of_mq_procs=4, debug logs disabled, disk watchdog disabled).

I execute "leofs-adm delete-bucket bodytest ". The state changes to "enqueuing":

[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node                         | node             | state
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55       | enqueuing        | 2017-07-12 22:01:19 +0300
storage_1@192.168.3.54       | enqueuing        | 2017-07-12 22:01:19 +0300
storage_0@192.168.3.53       | enqueuing        | 2017-07-12 22:01:19 +0300

info log on manager_0:

[I] manager_0@192.168.3.50  2017-07-12 20:26:56.909647 +0300    1499880416  leo_manager_del_bucket_handler:handle_call/3 - enqueue  128 [{"bucket_name",<<"bodytest">>},{"node",'storage_0@192.168.3.53'}]
[I] manager_0@192.168.3.50  2017-07-12 20:26:56.910019 +0300    1499880416  leo_manager_del_bucket_handler:handle_call/3 - enqueue  128 [{"bucket_name",<<"bodytest">>},{"node",'storage_1@192.168.3.54'}]
[I] manager_0@192.168.3.50  2017-07-12 20:26:56.910233 +0300    1499880416  leo_manager_del_bucket_handler:handle_call/3 - enqueue  128 [{"bucket_name",<<"bodytest">>},{"node",'storage_2@192.168.3.55'}]
[I] manager_0@192.168.3.50  2017-07-12 20:26:58.848183 +0300    1499880418  leo_manager_del_bucket_handler:notify_fun/3 264 [{"node",'storage_0@192.168.3.53'},{"bucket_name",<<"bodytest">>}]
[I] manager_0@192.168.3.50  2017-07-12 20:26:58.853656 +0300    1499880418  leo_manager_del_bucket_handler:notify_fun/3 264 [{"node",'storage_1@192.168.3.54'},{"bucket_name",<<"bodytest">>}]
[I] manager_0@192.168.3.50  2017-07-12 20:26:58.860125 +0300    1499880418  leo_manager_del_bucket_handler:notify_fun/3 264 [{"node",'storage_2@192.168.3.55'},{"bucket_name",<<"bodytest">>}]

All storage nodes get 120-150% CPU load, rarely peaking to 200-230%, very small disk load, soon I can see messages appearing in leo_delete_dir_queue_1 queue on each node (only in that queue). There are some usual timeout errors for delete operations in log files of storage nodes.

At some point - few minutes after the start of operation - the number in leo_delete_dir_queue_1 stops growing. It's fixed at some number for each node (these are current numbers for storage_0, storage_1 and storage_3):

 leo_delete_dir_queue_1         |   idling    | 84294          | 1600           | 500            | deletion bucket #1
 leo_delete_dir_queue_1         |   idling    | 93829          | 1600           | 500            | deletion bucket #1
 leo_delete_dir_queue_1         |   idling    | 92810          | 1600           | 500            | deletion bucket #1

The load on nodes is about the same, the errors are about the same. Then queue leo_async_deletion_queue starts to grow slowly; 1 message, then 2 messages, at some point - 10 messages or so. It usually grows at a rate of message every few minutes. At some point it started to grow faster, like every minute or two it gets roughly +10 messages.

Early part of error log on storage_1 (all the later parts and log on other nodes looks about the same):

[E] storage_1@192.168.3.54  2017-07-12 20:27:30.346997 +0300    1499880450  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-12 20:27:38.896010 +0300    1499880458  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-12 20:27:39.894537 +0300    1499880459  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{cause,timeout}]
[E] storage_1@192.168.3.54  2017-07-12 20:28:00.459117 +0300    1499880480  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-12 20:28:09.901206 +0300    1499880489  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/18/0e/37/180e37e1a6f6351bcaf29e1aaa8c5caa0c2c8a41952867f2bf0e0fbfcde5de3f65f42bf33a6d352dd637c28ec640c5ba00a2790000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-12 20:28:10.902269 +0300    1499880490  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/18/0e/37/180e37e1a6f6351bcaf29e1aaa8c5caa0c2c8a41952867f2bf0e0fbfcde5de3f65f42bf33a6d352dd637c28ec640c5ba00a2790000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-12 20:28:19.854854 +0300    1499880499  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-12 20:28:19.857589 +0300    1499880499  leo_storage_replicator:replicate_fun/2243   [{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_0,{get,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},30000]}}}]
[W] storage_1@192.168.3.54  2017-07-12 20:28:20.853112 +0300    1499880500  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{cause,timeout}]
[E] storage_1@192.168.3.54  2017-07-12 20:28:30.617920 +0300    1499880510  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-12 20:28:40.913576 +0300    1499880520  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/18/72/63/18726304f172a426fb2262362d53d4e252711bc0adf43fd9ca7a1ee5baeb6631f3828a17041908227b31420fe5ecba670600100000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-12 20:28:41.908985 +0300    1499880521  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/18/72/63/18726304f172a426fb2262362d53d4e252711bc0adf43fd9ca7a1ee5baeb6631f3828a17041908227b31420fe5ecba670600100000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-12 20:28:51.712962 +0300    1499880531  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-12 20:28:52.712510 +0300    1499880532  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{cause,timeout}]

Here, the message that contains "leo_storage_replicator:replicate_fun/2" is unique and happened only once on a single node. The rest of messages (from leo_storage_replicator:replicate/5, leo_storage_replicator:loop/6 and leo_storage_handler_del_directory:insert_messages/3) repeats all the time on all nodes.

Info log:

[I] storage_1@192.168.3.54  2017-07-12 20:26:58.857286 +0300    1499880418  leo_storage_handler_del_directory:run/5141  [{"msg: enqueued",<<"bodytest">>}]
[I] storage_1@192.168.3.54  2017-07-12 20:27:30.347326 +0300    1499880450  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I] storage_1@192.168.3.54  2017-07-12 20:27:38.894033 +0300    1499880458  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{method,head},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{processing_time,30001}]
[I] storage_1@192.168.3.54  2017-07-12 20:28:00.459587 +0300    1499880480  leo_object_storage_event:handle_event/254   [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]

there are no other types of messages in info log except for these from "leo_object_storage_event:handle_event/2".

The problem: nothing else happens. Executing "du" on a storage node under this load is annoying, but eventually works at some point:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_1@192.168.3.54
 active number of objects: 1502466
  total number of objects: 1509955
   active size of objects: 213408666322
    total size of objects: 213421617475
     ratio of active size: 99.99%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

These are the same numbers as before "delete bucket" operation; or almost the same. In other words, no objects seem to be deleted, with old implementation the "ratio of active size" started to drop as soon as delete queue started to process (1-2 minutes after start of "delete bucket" operation); here, two hours have passed but storage nodes show the same numbers of objects. It's the same for all nodes.

The status for all queues is "idling". Somehow leo_async_deletion_queue managed to get over 230 messages during this time:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep delet
 leo_async_deletion_queue       |   running   | 232            | 1600           | 500            | async deletion of objs
 leo_delete_dir_queue_1         |   idling    | 93828          | 1600           | 500            | deletion bucket #1
 leo_delete_dir_queue_2         |   idling    | 0              | 1600           | 500            | deletion bucket #2
 leo_delete_dir_queue_3         |   idling    | 0              | 1600           | 500            | deletion bucket #3
 leo_delete_dir_queue_4         |   idling    | 0              | 1600           | 500            | deletion bucket #4
 leo_delete_dir_queue_5         |   idling    | 0              | 1600           | 500            | deletion bucket #5
 leo_delete_dir_queue_6         |   idling    | 0              | 1600           | 500            | deletion bucket #6
 leo_delete_dir_queue_7         |   idling    | 0              | 1600           | 500            | deletion bucket #7
 leo_delete_dir_queue_8         |   idling    | 0              | 1600           | 500            | deletion bucket #8
 leo_req_delete_dir_queue       |   idling    | 0              | 1600           | 500            | request removing directories

Here are leo_doctor logs for storage_1: https://pastebin.com/mcm0AphX

mocchira commented 7 years ago

@vstax I could find the culprit thanks to your further testing. https://github.com/leo-project/leo_object_storage/pull/10 should fix your problem so please give it another try after the PR get merged.

vstax commented 7 years ago

@mocchira Thank you, this made a difference. After restarting manager & storage nodes with latest version things started to move. (btw, shutting down these nodes with ~90K messages in leo_delete_dir_queue_1 queue took over a minute for each node, and starting up after that took 5 minutes or so? I don't think I've seen such shutdown & startup times even when I had problems with watchdog before and was shutting down nodes that had similar amounts in "frozen" leo_async_deletion_queue queues. There is "alarm_handler: {set,{system_memory_high_watermark,[]}}" logged in erlang.log during these 5 minutes of startup).

Anyhow, after startup extra 70-80K messages appeared in leo_delete_dir_queue_1, a few messages in leo_async_deletion_queue, e.g.:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_0@192.168.3.53
              id                |    state    | number of msgs | batch of msgs  |    interval    |                 description
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
 leo_async_deletion_queue       |   running   | 9              | 1600           | 500            | async deletion of objs
 leo_comp_meta_with_dc_queue    |   idling    | 0              | 1600           | 500            | compare metadata w/remote-node
 leo_delete_dir_queue_1         |   idling    | 146698         | 1600           | 500            | deletion bucket #1

(there is something strange, though: leo_async_deletion_queue is switching between running and idling here all the time but the number of messages in it stayed at 9 for all 10 minutes)

I was able to execute "du" and ratio of active size was dropping. However, ten minutes later I got here:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54
              id                |    state    | number of msgs | batch of msgs  |    interval    |                 description
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
 leo_async_deletion_queue       |   idling    | 0              | 1600           | 500            | async deletion of objs
 leo_comp_meta_with_dc_queue    |   idling    | 0              | 1600           | 500            | compare metadata w/remote-node
 leo_delete_dir_queue_1         |   idling    | 0              | 1600           | 500            | deletion bucket #1

[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_1@192.168.3.54
 active number of objects: 1161526
  total number of objects: 1509955
   active size of objects: 165918882826
    total size of objects: 213421617475
     ratio of active size: 77.74%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node                         | node             | state                       
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55       | enqueuing        | 2017-07-13 18:22:59 +0300               
storage_1@192.168.3.54       | enqueuing        | 2017-07-13 18:22:59 +0300               
storage_0@192.168.3.53       | enqueuing        | 2017-07-13 18:22:59 +0300               

All queues at all nodes are at 0. No more objects are being removed - however, the final ratio of active size is supposed to be in 50-52% range, so not all objects were removed from the bucket.

And.. nothing else happens. There is nothing in error / info logs of manager nodes.

Error log from storage node:

[W] storage_1@192.168.3.54  2017-07-13 16:22:03.473647 +0300    1499952123  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/61/57/98/61579844256bbe76f6f05e7655b486b9cb5df369f15a28a980389f71eacd8ec9afa42c36451a504ee3e09322352ffce1388c170100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-13 16:22:03.677581 +0300    1499952123  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/2d/2b/d8/2d2bd8b1f701626d70f4f253384d748463a48f2007e61da5c104630267c049908977838a1028e268132585cbd268f3ca03c4020100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-13 16:22:04.472227 +0300    1499952124  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/61/57/98/61579844256bbe76f6f05e7655b486b9cb5df369f15a28a980389f71eacd8ec9afa42c36451a504ee3e09322352ffce1388c170100000000.xz">>},{cause,timeout}]

[skipped]

[W] storage_1@192.168.3.54  2017-07-13 16:33:49.576822 +0300    1499952829  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/a0/e7/ea/a0e7eafe1849e039b8e210c8dad8a8ea1272b2464926cc2091e711e47ec9af8e72ef4d462d2dd36a5e424f0fb19ff5bf1c62050000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-13 16:34:01.228851 +0300    1499952841  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/0b/cc/6a/0bcc6aa1cd433eecf3ce9cd7fde3c0055a52aa53359ebfebd01816f69b80e9fcf43b03bf4ddb452ed6d6082b6ade0e7108d2000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-13 16:34:02.226010 +0300    1499952842  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/0b/cc/6a/0bcc6aa1cd433eecf3ce9cd7fde3c0055a52aa53359ebfebd01816f69b80e9fcf43b03bf4ddb452ed6d6082b6ade0e7108d2000000000000.xz">>},{cause,timeout}]

All skipped messages look just like these two types of messages. It's the same for all other nodes. Nothing in info logs after startup messages. Here, 16:21 is when nodes finished starting up, 16:34 is around the time the processing has stopped.

Looking at top, I can see CPU usage going from near-zero to 20-30% or so for 1-2 seconds from time to time, then back at 0 for a few seconds. Just in case, leo_doctor report: https://pastebin.com/yk5SXS1N

I waited for 2 hours after that, no changes at all. Then I restarted one storage node (storage_1). I can see these short spikes in CPU usage again, but they are different - more short (0.5-1 sec) and to higher usage, like 60-100% (yes, I know that "top" can show quite unreliable values when trying to update too fast but I got another working LeoFS cluster to compare to so I can see that it's higher usage than normal). Here is leo_doctor report for this state: https://pastebin.com/iiiW8yWV

I've waited for 20 more minutes, nothing changed; I restarted both manager nodes, after that nothing has changed as well.

mocchira commented 7 years ago

@vstax Thanks for the further testing.

btw, shutting down these nodes with ~90K messages in leo_delete_dir_queue_1 queue took over a minute for each node, and starting up after that took 5 minutes or so? I don't think I've seen such shutdown & startup times even when I had problems with watchdog before and was shutting down nodes that had similar amounts in "frozen" leo_async_deletion_queue queues.

Since the previous version without my patch caused leo_delete_dir_queue_1 to keep generating lots of items as long as leo_storage was running (To be precious, fetching all deleted objects and inserting those into leo_delete_dir_queue_1 happened many times behind the scene), lots of tombstones were generated at leveldb so that the leveldb compaction process got triggered and caused shutdown/start up to take much time I guess.

Also since other wired things you faced might be caused by the previous bad behavior, please give it another try with the clean state if possible?

vstax commented 7 years ago

@mocchira sure, I will (I'll rollback just storage nodes first to older snapshot, if it won't work, the whole cluster). But isn't manager supposed to retry deletion of objects from bucket now, including across storage node restarts and even storage node losing queue? Is there some simple way to diagnose why it doesn't happen (or happens, but nodes refuse to accept this job)? I was pretty sure that restarting either storage or manager nodes - or, in worst case, both, like I did - should at very least make it try to continue delete.

Also, somewhat random question - a (quite rare, but nevertheless) case of new node being introduced during deletion of large bucket - a new node is added to cluster, then rebalance is launched - will the objects that weren't yet deleted on other nodes get pushed to this node as part of rebalance operation, and not deleted in the future, or "delete bucket" job will be pushed to this node as well, so that even if it gets some of the objects temporarily, they will be removed in the end anyway?

EDIT: Rolling back storage nodes (then installing latest version and launching them) with current version of manager node (which is "enqueuing" bucket deletion) did not help; I tried restarting everything, removing all queues on storage nodes, including "delete bucket" queue as well but still no changes. I think there is either some bug here (as I understand the desired implementation, the manager is supposed to re-queue bucket deletion request since it wasn't completed and currently storage nodes aren't doing deletion), or maybe I'm understanding the logic wrong? If storage node isn't supposed to continue deletion like that, shouldn't there be some knob on manager node, like "delete-deleted-bucket" command or something :) Because currently in this situation I can't re-create bucket (it's forbidden) to delete it again, so it's unobvious how to get out of this state. I think - if the aim is to get really reliable "delete bucket" operation - more experiments and testing are needed here, but first things first.

Rolling back everything and repeating delete-bucket command: it works (well, mostly). Here is error log from storage_0 - I filtered out all the numerous "Replicate failure" and "cause,timeout" messages:

[E] storage_0@192.168.3.53  2017-07-14 21:19:55.575122 +0300    1500056395  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_0@192.168.3.53  2017-07-14 21:20:49.777239 +0300    1500056449  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_0@192.168.3.53  2017-07-14 21:22:03.488293 +0300    1500056523  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_0@192.168.3.53  2017-07-14 21:23:46.42101 +0300 1500056626  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_0@192.168.3.53  2017-07-14 21:24:17.690158 +0300    1500056657  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[W] storage_0@192.168.3.53  2017-07-14 21:24:17.698517 +0300    1500056657  leo_storage_handler_object:replicate_fun/3  1399    [{cause,"Could not get a metadata"}]
[E] storage_0@192.168.3.53  2017-07-14 21:24:17.702573 +0300    1500056657  leo_storage_handler_object:put/4    416 [{from,storage},{method,delete},{key,<<"bodytest/4b/46/38/4b463858a715ee23a484a29121097c6a9caa3b65af7f1c901de4850123545fb5fa3845145ead1229a748607308f2867628b8000000000000.xz">>},{req_id,0},{cause,"Could not get a metadata"}]
[E] storage_0@192.168.3.53  2017-07-14 21:26:20.829152 +0300    1500056780  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[W] storage_0@192.168.3.53  2017-07-14 21:26:28.113795 +0300    1500056788  leo_storage_replicator:replicate_fun/2243   [{key,<<"bodytest/05/2b/30/052b30c599682a3b8b328114610e406a5458b05102cfdda2656f4095589ae588ad3773646d0c78d843b30a441e2f58409815010000000000.xz">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_5,{get,<<"bodytest/05/2b/30/052b30c599682a3b8b328114610e406a5458b05102cfdda2656f4095589ae588ad3773646d0c78d843b30a441e2f58409815010000000000.xz">>},30000]}}}]
[E] storage_0@192.168.3.53  2017-07-14 21:28:51.495146 +0300    1500056931  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_0@192.168.3.53  2017-07-14 21:31:40.922056 +0300    1500057100  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_0@192.168.3.53  2017-07-14 21:34:25.719290 +0300    1500057265  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,40,178,86,218,64,6,172,208,7,135,201,36,182,30,23,89,109,0,0,0,133,98,111,100,121,116,101,115,116,47,99,53,47,56,54,47,54,100,47,99,53,56,54,54,100,102,51,49,102,56,53,97,54,53,50,99,54,54,99,49,101,101,49,98,49,55,99,54,48,98,54,55,57,54,56,52,100,53,49,51,57,51,54,102,48,102,102,51,101,101,52,50,98,101,97,54,52,100,98,52,57,98,53,53,102,53,49,98,100,99,56,54,49,101,53,55,98,101,99,55,54,98,48,101,52,50,100,53,53,101,54,99,101,100,52,48,48,51,50,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]

The "leo_storage_handler_object:put/4" line is put of 0-byte object before deletion? So it happens during new delete-bucket operation as well, is this by design?

Similar log from storage_1:

[E] storage_1@192.168.3.54  2017-07-14 21:19:23.319297 +0300    1500056363  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54  2017-07-14 21:20:06.136898 +0300    1500056406  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54  2017-07-14 21:21:13.620240 +0300    1500056473  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54  2017-07-14 21:22:42.229278 +0300    1500056562  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54  2017-07-14 21:24:35.12173 +0300 1500056675  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54  2017-07-14 21:25:06.536219 +0300    1500056706  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54  2017-07-14 21:27:22.907614 +0300    1500056842  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[W] storage_1@192.168.3.54  2017-07-14 21:27:27.993717 +0300    1500056847  leo_storage_replicator:replicate_fun/2243   [{key,<<"bodytest/6b/10/b9/6b10b9777d3084b24f98defbd4260844227241b532cfba26a3ea0e4172bef61ff6dcd44a6f77434f434d926565e0b7c9d97e580000000000.xz\n2">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_5,{get,<<"bodytest/6b/10/b9/6b10b9777d3084b24f98defbd4260844227241b532cfba26a3ea0e4172bef61ff6dcd44a6f77434f434d926565e0b7c9d97e580000000000.xz\n2">>},30000]}}}]
[W] storage_1@192.168.3.54  2017-07-14 21:27:38.231337 +0300    1500056858  leo_storage_handler_object:replicate_fun/3  1399    [{cause,"Could not get a metadata"}]
[E] storage_1@192.168.3.54  2017-07-14 21:27:38.231656 +0300    1500056858  leo_storage_handler_object:put/4    416 [{from,storage},{method,delete},{key,<<"bodytest/59/e9/af/59e9afbd96a2a8e107c7fcabf679a6b353c3c1c2d365567161cefef081bb390a684367eb4ecb1b8e6cd7a7e77a21f14f087d380100000000.xz\n1">>},{req_id,0},{cause,"Could not get a metadata"}]
[E] storage_1@192.168.3.54  2017-07-14 21:30:07.153217 +0300    1500057007  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54  2017-07-14 21:33:12.536375 +0300    1500057192  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[W] storage_1@192.168.3.54  2017-07-14 21:33:12.718884 +0300    1500057192  leo_storage_replicator:replicate_fun/2243   [{key,<<"bodytest/00/06/94/0006943d1713c0a231bcefe412a8dd9287bfde37f5177f02d13f35b8cd6507b4b074d7e8143b7e9b7ecc8e9ded4043dba013010000000000.xz">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_7,{get,<<"bodytest/00/06/94/0006943d1713c0a231bcefe412a8dd9287bfde37f5177f02d13f35b8cd6507b4b074d7e8143b7e9b7ecc8e9ded4043dba013010000000000.xz">>},30000]}}}]
[W] storage_1@192.168.3.54  2017-07-14 21:35:16.998265 +0300    1500057316  leo_storage_handler_directory:find_by_parent_dir/4  78  [{errors,[]},{bad_nodes,['storage_2@192.168.3.55','storage_1@192.168.3.54']},{cause,"Could not get metadatas"}]
[W] storage_1@192.168.3.54  2017-07-14 21:35:59.718192 +0300    1500057359  leo_storage_handler_directory:find_by_parent_dir/4  78  [{errors,[]},{bad_nodes,['storage_2@192.168.3.55']},{cause,"Could not get metadatas"}]
[E] storage_1@192.168.3.54  2017-07-14 21:37:59.412479 +0300    1500057479  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_1@192.168.3.54  2017-07-14 21:37:59.414564 +0300    1500057479  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,57,245,188,171,29,125,219,95,10,67,244,173,201,222,171,75,109,0,0,0,133,98,111,100,121,116,101,115,116,47,52,99,47,49,48,47,53,49,47,52,99,49,48,53,49,97,101,97,54,53,55,98,100,51,102,100,99,51,54,98,52,54,56,50,51,102,101,54,52,101,102,48,56,99,101,97,101,50,98,52,49,54,51,56,48,98,55,51,97,52,55,54,52,97,49,99,98,56,49,102,57,100,48,101,101,98,102,100,50,57,55,53,52,49,98,99,55,100,54,52,48,51,99,98,54,52,56,48,50,50,48,101,53,98,97,48,48,53,52,48,49,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_1@192.168.3.54  2017-07-14 21:37:59.420628 +0300    1500057479  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_1@192.168.3.54  2017-07-14 21:37:59.421313 +0300    1500057479  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.538.0> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_1@192.168.3.54  2017-07-14 21:40:38.343404 +0300    1500057638  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_1@192.168.3.54  2017-07-14 21:40:38.344038 +0300    1500057638  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_1@192.168.3.54  2017-07-14 21:40:38.344458 +0300    1500057638  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.14616.18> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated

This one looks worse. Is it bad?

For storage_2:

[E] storage_2@192.168.3.55  2017-07-14 21:19:25.76735 +0300 1500056365  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55  2017-07-14 21:20:08.735703 +0300    1500056408  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55  2017-07-14 21:21:16.562799 +0300    1500056476  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55  2017-07-14 21:22:46.88858 +0300 1500056566  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55  2017-07-14 21:24:41.374639 +0300    1500056681  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55  2017-07-14 21:25:11.410660 +0300    1500056711  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55  2017-07-14 21:25:41.491085 +0300    1500056741  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55  2017-07-14 21:27:54.329783 +0300    1500056874  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55  2017-07-14 21:30:43.399117 +0300    1500057043  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55  2017-07-14 21:33:48.193631 +0300    1500057228  leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[W] storage_2@192.168.3.55  2017-07-14 21:34:46.951687 +0300    1500057286  leo_storage_handler_directory:find_by_parent_dir/4  78  [{errors,[]},{bad_nodes,['storage_2@192.168.3.55','storage_1@192.168.3.54']},{cause,"Could not get metadatas"}]
[W] storage_2@192.168.3.55  2017-07-14 21:35:29.724027 +0300    1500057329  leo_storage_handler_directory:find_by_parent_dir/4  78  [{errors,[]},{bad_nodes,['storage_2@192.168.3.55','storage_1@192.168.3.54']},{cause,"Could not get metadatas"}]
[E] storage_2@192.168.3.55  2017-07-14 21:39:56.487206 +0300    1500057596  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in gen_server:call/3 line 212
[E] storage_2@192.168.3.55  2017-07-14 21:39:56.493595 +0300    1500057596  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_4",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_2@192.168.3.55  2017-07-14 21:39:56.494257 +0300    1500057596  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.545.0> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in context child_terminated

"Replicate failure" errors look like this:

error.20170714.21.2:[E] storage_0@192.168.3.53  2017-07-14 21:21:23.29522 +0300 1500056483  leo_storage_handler_object:delete/3 569 [{from,gateway},{method,del},{key,<<"bodytest/c0/cc/45/c0cc45da65c101e4ab7e910895dc21c00cbdfbb7e0e4dc95ddab44809a28ee37185a026604138ee5a7c0cade881826d1c8035c0000000000.xz\n2">>},{req_id,0},{cause,"Replicate failure"}]
error.20170714.21.2:[E] storage_0@192.168.3.53  2017-07-14 21:22:53.353052 +0300    1500056573  leo_storage_handler_object:delete/3 569 [{from,gateway},{method,del},{key,<<"bodytest/57/55/49/575549c0cc19ff922b960336f0961070f272a57b8529e56b03b5efe1767805838d3b3a3c78f88d03f58af214c3244ece5ba62b0100000000.xz\n4">>},{req_id,0},{cause,"Replicate failure"}]
error.20170714.21.2:[E] storage_0@192.168.3.53  2017-07-14 21:22:55.907665 +0300    1500056575  leo_storage_handler_object:delete/3 569 [{from,gateway},{method,del},{key,<<"bodytest/fc/9c/02/fc9c02b27fee332a60f0e83b1d694a519af941484a0c177187843589c37986893b5a2bef48f99b38f2756285d09251cccff3040100000000.xz\n4">>},{req_id,0},{cause,"Replicate failure"}]

Is it supposed to be like that ("from,gateway") message? I did bucket deletion directly from manager, so gateway wasn't involved, I think?

Now about problems:

  1. For two nodes - storage_0 and storage_1 the bucket deletion stats went "enqueuing->monitoring->finished" fine, however it took really long time for it to switch from "monitoring" to "finished". E.g. on storage_0 the deletion queue was empty at around 21:40, but the state switched from "monitoring" to "finished" only about 25 minutes later. Also, the message "dequeued and removed" never appeared in log file. Not saying that this is a serious problem, but mentioning this just in case
  2. For storage_1 the message appears in log:
    [I] storage_1@192.168.3.54  2017-07-14 22:07:50.544929 +0300    1500059270  leo_storage_handler_del_directory:run/5558  [{"msg: dequeued and removed (bucket)",<<"bodytest">>}]

    however it happened only about 20 minutes after the bucket deletion seemingly finished. Again, most of this time the state was "monitoring" (not sure if there is correlation and it was like that all the time, though)

  3. For storage_2, the message appeared
    [I] storage_2@192.168.3.55  2017-07-14 22:07:53.719258 +0300    1500059273  leo_storage_handler_del_directory:run/5558  [{"msg: dequeued and removed (bucket)",<<"bodytest">>}]

    however the state never changed from "enqueuing" (which is shown with current time)! The deletion process has completed a long time ago, but manager doesn't seem to get that information:

    
    [root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
    - Bucket: bodytest
    node                         | node             | state
    -----------------------------+------------------+-----------------------------
    storage_2@192.168.3.55       | finished         | 2017-07-14 22:07:53 +0300
    storage_1@192.168.3.54       | finished         | 2017-07-14 22:07:50 +0300
    storage_0@192.168.3.53       | enqueuing        | 2017-07-14 22:30:02 +0300               

[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|head -5 id | state | number of msgs | batch of msgs | interval | description --------------------------------+-------------+----------------|----------------|----------------|--------------------------------------------- leo_async_deletion_queue | idling | 0 | 1600 | 500 | async deletion of objs leo_comp_meta_with_dc_queue | idling | 0 | 1600 | 500 | compare metadata w/remote-node leo_delete_dir_queue_1 | idling | 0 | 1600 | 500 | deletion bucket #1

[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_2@192.168.3.55 active number of objects: 742982 total number of objects: 1493689 active size of objects: 105167655058 total size of objects: 210052283999 ratio of active size: 50.07% last compaction start: __-- :: last compaction end: __-- ::


this is most serious problem here, I think.

I'm unable to check if 100% of objects are removed or the errors or timeouts have caused some to remain because of this (as the easiest way to check is to re-create bucket, "diagnose-start" will show me all these removed objects so I don't want to rely on it yet).
mocchira commented 7 years ago

@vstax Thanks for trying.

But isn't manager supposed to retry deletion of objects from bucket now, including across storage node restarts and even storage node losing queue? Is there some simple way to diagnose why it doesn't happen (or happens, but nodes refuse to accept this job)? I was pretty sure that restarting either storage or manager nodes - or, in worst case, both, like I did - should at very least make it try to continue delete.

Yes it's supposed to retry. However the first 1.3.5-rc3 (without my patch) had a serious problem that might cause the internal state to get inconsistent so that I'd recommend try with the clean state for safe. Please let me know if you face the same problem when using 1.3.5-rc3 with my patch and data created with >= that version.

Also, somewhat random question - a (quite rare, but nevertheless) case of new node being introduced during deletion of large bucket - a new node is added to cluster, then rebalance is launched - will the objects that weren't yet deleted on other nodes get pushed to this node as part of rebalance operation, and not deleted in the future, or "delete bucket" job will be pushed to this node as well, so that even if it gets some of the objects temporarily, they will be removed in the end anyway?

Good question. when the rebalance is launched during delete-bucket is ongoing, any objects that belonged to a deleted bucket are not transferred to a new node so that there is no need to delete on a new node.

Rolling back storage nodes (then installing latest version and launching them) with current version of manager node (which is "enqueuing" bucket deletion) did not help; I tried restarting everything, removing all queues on storage nodes, including "delete bucket" queue as well but still no changes. I think there is either some bug here (as I understand the desired implementation, the manager is supposed to re-queue bucket deletion request since it wasn't completed and currently storage nodes aren't doing deletion), or maybe I'm understanding the logic wrong? If storage node isn't supposed to continue deletion like that, shouldn't there be some knob on manager node, like "delete-deleted-bucket" command or something :) Because currently in this situation I can't re-create bucket (it's forbidden) to delete it again, so it's unobvious how to get out of this state. I think - if the aim is to get really reliable "delete bucket" operation - more experiments and testing are needed here, but first things first.

Got it. since there are records in mnesia on manager(s) managing delete-bucket stats(pending, enqueuing, monitoring, finished), the usual rolling back strategy don't work if there is any differences between manager(s) and storage(s). so as you said, providing a command named (delete-deleted-bucket) you suggested might be needed for safe. we'd consider to add such command with its appropriate name.

Rolling back everything and repeating delete-bucket command: it works (well, mostly).

Could you elaborate about what repeating means?

The "leo_storage_handler_object:put/4" line is put of 0-byte object before deletion? So it happens during new delete-bucket operation as well, is this by design?

put/4 can be called with setting delete-flag to true when replicating a delete request on a remote node so yes it is by design.

This one looks worse. Is it bad?

Yes seems to be bad. What really matters is leo_storage_handler_del_directory:insert_messages failed many times. (This function should not fail under normal circumstances so I will vet how/when it can happen)

Is it supposed to be like that ("from,gateway") message? I did bucket deletion directly from manager, so gateway wasn't involved, I think?

Yes such kind of errors should not appear in log, I will vet.

For two nodes - storage_0 and storage_1 the bucket deletion stats went "enqueuing->monitoring->finished" fine, however it took really long time for it to switch from "monitoring" to "finished". E.g. on storage_0 the deletion queue was empty at around 21:40, but the state switched from "monitoring" to "finished" only about 25 minutes later. Also, the message "dequeued and removed" never appeared in log file. Not saying that this is a serious problem, but mentioning this just in case

Taking time to switch from "monitoring" to "finished" is expected. Let me explain what the each state mean.

Also, the message "dequeued and removed" never appeared in log file. Not saying that this is a serious problem, but mentioning this just in case

It's problematic.

however the state never changed from "enqueuing" (which is shown with current time)! The deletion process has completed a long time ago, but manager doesn't seem to get that information:

The same root problem could cause this behavior and the above "dequeued and removed" never appeared. I will vet in depth.

mocchira commented 7 years ago

@vstax found the other problem that cause leo_storage_handler_del_directory:insert_messages to fail. Please give it another try with the clean state once https://github.com/leo-project/leo_backend_db/pull/11 get merged.

Note: other fixes I mentioned on the above comment are WIP(I will send PR tomorrow)

mocchira commented 7 years ago

@vstax other issues described on the above comment will be fixed once https://github.com/leo-project/leofs/pull/786 get merged.

Got it. since there are records in mnesia on manager(s) managing delete-bucket stats(pending, enqueuing, monitoring, finished), the usual rolling back strategy don't work if there is any differences between manager(s) and storage(s). so as you said, providing a command named (delete-deleted-bucket) you suggested might be needed for safe. we'd consider to add such command with its appropriate name.

reset-delete-bucket-stats added.

Is it supposed to be like that ("from,gateway") message? I did bucket deletion directly from manager, so gateway wasn't involved, I think?

Yes such kind of errors should not appear in log, I will vet.

{from, storage} or {from, leo_mq} outputted instead of {from, gateway}.

The same root problem could cause this behavior and the above "dequeued and removed" never appeared. I will vet in depth.

Fixed the finish notification to managers must happen (strict retry mechanism is implemented)

Please check these improvements along with the fix on leo_backend_db.

vstax commented 7 years ago

@mocchira Thank you for your support.

Rolling back everything and repeating delete-bucket command: it works (well, mostly).

Could you elaborate about what repeating means?

Just doing "delete-bucket" here; "repeating" as in repeating the experiment because I rolled back managers as well to the state when "delete-bucket" was never executed.

I've tried deleting the same bucket with these changes, and I don't think that fix for "finish notification" works; the state remains "enqueuing" even 1 hour after delete operation has finished. This happens for all storage nodes:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node                         | node             | state                       
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55       | enqueuing        | 2017-07-19 20:27:27 +0300               
storage_1@192.168.3.54       | enqueuing        | 2017-07-19 20:27:26 +0300               
storage_0@192.168.3.53       | enqueuing        | 2017-07-19 20:27:28 +0300               

Error log on storage_0:

[W] storage_0@192.168.3.53  2017-07-19 19:10:49.513796 +0300    1500480649  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/00/e4/ce/00e4ce45fd1c5de4122221d44289f4ace93dc0d046fead4f4d3549b7b756af04621a4ab684a1c7db7b8d5f017555484d90d7000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:10:50.491836 +0300    1500480650  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/00/e4/ce/00e4ce45fd1c5de4122221d44289f4ace93dc0d046fead4f4d3549b7b756af04621a4ab684a1c7db7b8d5f017555484d90d7000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:11:29.514890 +0300    1500480689  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/01/42/6d/01426d00bc6a92b42e0399efbced9661579b45cc7001cd89db32d95c9930a2dd0b8f95122da45bf9be35de95dd816c5df2f6000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:11:30.479497 +0300    1500480690  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/01/42/6d/01426d00bc6a92b42e0399efbced9661579b45cc7001cd89db32d95c9930a2dd0b8f95122da45bf9be35de95dd816c5df2f6000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:11:52.677341 +0300    1500480712  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/55/dd/da/55dddad03e531499b6844655a7621ad3c1fa4fd1653de1187d2055a5d4edc719250d77b7311852571820300783c73ee2f7fb030100000000.xz\n4">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:11:52.686279 +0300    1500480712  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/55/dd/da/55dddad03e531499b6844655a7621ad3c1fa4fd1653de1187d2055a5d4edc719250d77b7311852571820300783c73ee2f7fb030100000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:11:53.675244 +0300    1500480713  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/55/dd/da/55dddad03e531499b6844655a7621ad3c1fa4fd1653de1187d2055a5d4edc719250d77b7311852571820300783c73ee2f7fb030100000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:11:53.677522 +0300    1500480713  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/55/dd/da/55dddad03e531499b6844655a7621ad3c1fa4fd1653de1187d2055a5d4edc719250d77b7311852571820300783c73ee2f7fb030100000000.xz\n4">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:11:53.677857 +0300    1500480713  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/55/dd/da/55dddad03e531499b6844655a7621ad3c1fa4fd1653de1187d2055a5d4edc719250d77b7311852571820300783c73ee2f7fb030100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_0@192.168.3.53  2017-07-19 19:12:01.484527 +0300    1500480721  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/02/a2/84/02a28429564d4d1430fdfb6603516bfdae6ad7af6b5c53115dca766df28cb4e1c950bf84d35890e1bda40b396833b50dbc76010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:12:07.455874 +0300    1500480727  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/02/a2/84/02a28429564d4d1430fdfb6603516bfdae6ad7af6b5c53115dca766df28cb4e1c950bf84d35890e1bda40b396833b50dbc76010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:12:08.404983 +0300    1500480728  leo_storage_handler_object:replicate_fun/31406  [{cause,"Could not get a metadata"}]
[E] storage_0@192.168.3.53  2017-07-19 19:12:08.405934 +0300    1500480728  leo_storage_handler_object:put/4    423 [{from,storage},{method,delete},{key,<<"bodytest/0f/1b/2f/0f1b2fbb7a99d5d05ab57604781b3b9aef44f62a7666b293c97bc1814c23025174f50dfdbdb7ae67118dbb3054c4dd71e87ca30000000000.xz\n2">>},{req_id,0},{cause,"Could not get a metadata"}]
[W] storage_0@192.168.3.53  2017-07-19 19:12:08.656845 +0300    1500480728  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/01/a7/e7/01a7e77059f793397d743aa7d6564c1bd8455087dfa98d81d37c3c68f32a33157665a8c7a2a83006b51ae7774bc387d40cec000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:12:09.633137 +0300    1500480729  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/01/a7/e7/01a7e77059f793397d743aa7d6564c1bd8455087dfa98d81d37c3c68f32a33157665a8c7a2a83006b51ae7774bc387d40cec000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:12:40.635781 +0300    1500480760  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/01/79/02/0179021507a19bc291d357e2af7c8b384ed202399013a013cc4fe4ffdcba45c0ce4330af290e19ff07c382a74d0d403628fe010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:12:52.98035 +0300 1500480772  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/01/79/02/0179021507a19bc291d357e2af7c8b384ed202399013a013cc4fe4ffdcba45c0ce4330af290e19ff07c382a74d0d403628fe010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:12:53.43253 +0300 1500480773  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/00/56/99/005699470708547cd51606f1e613bc03011c0c9ee072e7af9a3f2d8645a555fd05c5580f1263b1404df12e3809cd2d60e422010000000000.xz">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_7,{get,<<"bodytest/00/56/99/005699470708547cd51606f1e613bc03011c0c9ee072e7af9a3f2d8645a555fd05c5580f1263b1404df12e3809cd2d60e422010000000000.xz">>},30000]}}}]
[W] storage_0@192.168.3.53  2017-07-19 19:12:54.43895 +0300 1500480774  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/00/56/99/005699470708547cd51606f1e613bc03011c0c9ee072e7af9a3f2d8645a555fd05c5580f1263b1404df12e3809cd2d60e422010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:13:23.49232 +0300 1500480803  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/00/56/99/005699470708547cd51606f1e613bc03011c0c9ee072e7af9a3f2d8645a555fd05c5580f1263b1404df12e3809cd2d60e422010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:13:24.113435 +0300    1500480804  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/00/56/99/005699470708547cd51606f1e613bc03011c0c9ee072e7af9a3f2d8645a555fd05c5580f1263b1404df12e3809cd2d60e422010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:13:25.50324 +0300 1500480805  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/07/53/51/0753515f78df7f271a5e61c20bcd36a1a8d600cd0c592dfb875de2d4f1aedb207b80a43cf724051b6552bb6e539e9afc0027020000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:13:34.913529 +0300    1500480814  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/07/53/51/0753515f78df7f271a5e61c20bcd36a1a8d600cd0c592dfb875de2d4f1aedb207b80a43cf724051b6552bb6e539e9afc0027020000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53  2017-07-19 19:13:34.914408 +0300    1500480814  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/00/56/99/005699470708547cd51606f1e613bc03011c0c9ee072e7af9a3f2d8645a555fd05c5580f1263b1404df12e3809cd2d60e422010000000000.xz">>},{cause,timeout}]

on storage_1:

[W] storage_1@192.168.3.54  2017-07-19 19:09:50.480513 +0300    1500480590  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/52/1d/82/521d82ec1c121d384f0ae426745289d7d48359eab6c2cd99c176419fd897a17875d3f2f6fef5584dea4f130504d90fee4866770000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:09:50.486326 +0300    1500480590  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/52/1d/82/521d82ec1c121d384f0ae426745289d7d48359eab6c2cd99c176419fd897a17875d3f2f6fef5584dea4f130504d90fee4866770000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:09:51.476814 +0300    1500480591  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/52/1d/82/521d82ec1c121d384f0ae426745289d7d48359eab6c2cd99c176419fd897a17875d3f2f6fef5584dea4f130504d90fee4866770000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:09:51.477865 +0300    1500480591  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/52/1d/82/521d82ec1c121d384f0ae426745289d7d48359eab6c2cd99c176419fd897a17875d3f2f6fef5584dea4f130504d90fee4866770000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:09:51.478555 +0300    1500480591  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/52/1d/82/521d82ec1c121d384f0ae426745289d7d48359eab6c2cd99c176419fd897a17875d3f2f6fef5584dea4f130504d90fee4866770000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-19 19:10:31.563548 +0300    1500480631  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/06/be/2c/06be2cb88c9fb144797158d9b4dceaa6a7985bb629e4fbb6eda7ef96916aee78fc2463cc1b1f8cd8f78790919fae115314ac050000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:10:47.178852 +0300    1500480647  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/06/be/2c/06be2cb88c9fb144797158d9b4dceaa6a7985bb629e4fbb6eda7ef96916aee78fc2463cc1b1f8cd8f78790919fae115314ac050000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:07.928271 +0300    1500480667  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:12.860699 +0300    1500480672  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/e5/a3/84/e5a384f216d906661267a25f00c7e91f69d61e0b628d527940226028734d07dd7fc12da6dcb17936a2ac13ff1a15087248d5df0000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:12.865019 +0300    1500480672  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/e5/a3/84/e5a384f216d906661267a25f00c7e91f69d61e0b628d527940226028734d07dd7fc12da6dcb17936a2ac13ff1a15087248d5df0000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:13.859764 +0300    1500480673  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/e5/a3/84/e5a384f216d906661267a25f00c7e91f69d61e0b628d527940226028734d07dd7fc12da6dcb17936a2ac13ff1a15087248d5df0000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:13.860610 +0300    1500480673  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/e5/a3/84/e5a384f216d906661267a25f00c7e91f69d61e0b628d527940226028734d07dd7fc12da6dcb17936a2ac13ff1a15087248d5df0000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:13.861225 +0300    1500480673  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/e5/a3/84/e5a384f216d906661267a25f00c7e91f69d61e0b628d527940226028734d07dd7fc12da6dcb17936a2ac13ff1a15087248d5df0000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:28.692642 +0300    1500480688  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:47.549966 +0300    1500480707  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/05/e1/67/05e1671b6b5a22a1d3d19f6b635298859fb4ba66a08b31ff251c9004acc8096ea20b01c55bb8fe68fac3f9a1b41cd5dec580000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:48.547618 +0300    1500480708  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/05/e1/67/05e1671b6b5a22a1d3d19f6b635298859fb4ba66a08b31ff251c9004acc8096ea20b01c55bb8fe68fac3f9a1b41cd5dec580000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:51.430967 +0300    1500480711  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/cb/f1/83/cbf183675706473d4b747c12281bf50fc541a238ca442b2e14fad0eb71476657aa0cfa79aed45ab98f195745aabf8885b843580000000000.xz\n1">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:51.431480 +0300    1500480711  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/cb/f1/83/cbf183675706473d4b747c12281bf50fc541a238ca442b2e14fad0eb71476657aa0cfa79aed45ab98f195745aabf8885b843580000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:52.418610 +0300    1500480712  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/cb/f1/83/cbf183675706473d4b747c12281bf50fc541a238ca442b2e14fad0eb71476657aa0cfa79aed45ab98f195745aabf8885b843580000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:52.430601 +0300    1500480712  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/cb/f1/83/cbf183675706473d4b747c12281bf50fc541a238ca442b2e14fad0eb71476657aa0cfa79aed45ab98f195745aabf8885b843580000000000.xz\n1">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:52.430914 +0300    1500480712  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/cb/f1/83/cbf183675706473d4b747c12281bf50fc541a238ca442b2e14fad0eb71476657aa0cfa79aed45ab98f195745aabf8885b843580000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:55.542012 +0300    1500480715  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/70/01/39/700139f980d397cff8b770b72b53ae91474a7abcb59df0b2d246bbde6178b3b7a173e2cc6b16cd13922f61a5a12609404003630000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:55.543962 +0300    1500480715  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/70/01/39/700139f980d397cff8b770b72b53ae91474a7abcb59df0b2d246bbde6178b3b7a173e2cc6b16cd13922f61a5a12609404003630000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:56.542039 +0300    1500480716  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/70/01/39/700139f980d397cff8b770b72b53ae91474a7abcb59df0b2d246bbde6178b3b7a173e2cc6b16cd13922f61a5a12609404003630000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:56.542573 +0300    1500480716  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/70/01/39/700139f980d397cff8b770b72b53ae91474a7abcb59df0b2d246bbde6178b3b7a173e2cc6b16cd13922f61a5a12609404003630000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:11:56.542915 +0300    1500480716  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/70/01/39/700139f980d397cff8b770b72b53ae91474a7abcb59df0b2d246bbde6178b3b7a173e2cc6b16cd13922f61a5a12609404003630000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-19 19:12:34.366773 +0300    1500480754  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/02/c4/1b/02c41b78d3d44275fd5d817387b5c21bb1f5ffb907af90c20042428c1671a36357253eafe70aaaa0bb64876d52c7b1e3580b040000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:12:34.737725 +0300    1500480754  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/9d/94/18/9d9418650983a9e30bd7f13a673ca730b2f0d4768c9219216295d63dcee018c0b987bc4953cf8ae704008d8453ba54eba8fc700000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:12:34.742791 +0300    1500480754  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/9d/94/18/9d9418650983a9e30bd7f13a673ca730b2f0d4768c9219216295d63dcee018c0b987bc4953cf8ae704008d8453ba54eba8fc700000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:12:35.738158 +0300    1500480755  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/9d/94/18/9d9418650983a9e30bd7f13a673ca730b2f0d4768c9219216295d63dcee018c0b987bc4953cf8ae704008d8453ba54eba8fc700000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:12:35.738436 +0300    1500480755  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/9d/94/18/9d9418650983a9e30bd7f13a673ca730b2f0d4768c9219216295d63dcee018c0b987bc4953cf8ae704008d8453ba54eba8fc700000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:12:35.738732 +0300    1500480755  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/9d/94/18/9d9418650983a9e30bd7f13a673ca730b2f0d4768c9219216295d63dcee018c0b987bc4953cf8ae704008d8453ba54eba8fc700000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-19 19:12:44.985651 +0300    1500480764  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/c1/cb/49/c1cb49681fc5b5faecef846d33ac51381a3072bf18f7ddb96cb4e460124d6533f1fdf7d85bc6a85fb5236068b552184bbfc7020100000000.xz\n4">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:12:44.987551 +0300    1500480764  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/c1/cb/49/c1cb49681fc5b5faecef846d33ac51381a3072bf18f7ddb96cb4e460124d6533f1fdf7d85bc6a85fb5236068b552184bbfc7020100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:12:45.985933 +0300    1500480765  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/c1/cb/49/c1cb49681fc5b5faecef846d33ac51381a3072bf18f7ddb96cb4e460124d6533f1fdf7d85bc6a85fb5236068b552184bbfc7020100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:12:45.986515 +0300    1500480765  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/c1/cb/49/c1cb49681fc5b5faecef846d33ac51381a3072bf18f7ddb96cb4e460124d6533f1fdf7d85bc6a85fb5236068b552184bbfc7020100000000.xz\n4">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:12:45.986925 +0300    1500480765  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/c1/cb/49/c1cb49681fc5b5faecef846d33ac51381a3072bf18f7ddb96cb4e460124d6533f1fdf7d85bc6a85fb5236068b552184bbfc7020100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-19 19:12:52.102327 +0300    1500480772  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/02/c4/1b/02c41b78d3d44275fd5d817387b5c21bb1f5ffb907af90c20042428c1671a36357253eafe70aaaa0bb64876d52c7b1e3580b040000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:13:19.736093 +0300    1500480799  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/00/e6/66/00e666b4e8c62d42ccabb1b05c8111df33bf4c8a042881df5fcd33bde3d709f41c219f7cdd7ed58ec9947657201c5afa0010000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:13:27.36886 +0300 1500480807  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/8f/8d/7c/8f8d7c273b8eadb4aeab29726ff21a5f39e5f75ef05ce9185afd769cc9f22bb8f0be0780dc30f595fc7e350dd946ffa880e73f0100000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:13:27.38654 +0300 1500480807  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/8f/8d/7c/8f8d7c273b8eadb4aeab29726ff21a5f39e5f75ef05ce9185afd769cc9f22bb8f0be0780dc30f595fc7e350dd946ffa880e73f0100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:13:28.36973 +0300 1500480808  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/8f/8d/7c/8f8d7c273b8eadb4aeab29726ff21a5f39e5f75ef05ce9185afd769cc9f22bb8f0be0780dc30f595fc7e350dd946ffa880e73f0100000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:13:28.37329 +0300 1500480808  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/8f/8d/7c/8f8d7c273b8eadb4aeab29726ff21a5f39e5f75ef05ce9185afd769cc9f22bb8f0be0780dc30f595fc7e350dd946ffa880e73f0100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54  2017-07-19 19:13:28.37599 +0300 1500480808  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/8f/8d/7c/8f8d7c273b8eadb4aeab29726ff21a5f39e5f75ef05ce9185afd769cc9f22bb8f0be0780dc30f595fc7e350dd946ffa880e73f0100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-19 19:13:34.918759 +0300    1500480814  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/00/e6/66/00e666b4e8c62d42ccabb1b05c8111df33bf4c8a042881df5fcd33bde3d709f41c219f7cdd7ed58ec9947657201c5afa0010000000000000.xz">>},{cause,timeout}]
[E] storage_1@192.168.3.54  2017-07-19 19:13:42.151582 +0300    1500480822  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,17,166,100,223,94,65,242,145,215,154,235,43,100,217,227,139,109,0,0,0,133,98,111,100,121,116,101,115,116,47,53,53,47,57,102,47,97,48,47,53,53,57,102,97,48,51,51,98,52,49,54,99,57,97,48,57,56,57,50,52,53,102,102,99,52,48,48,57,52,57,54,53,49,50,54,99,55,51,51,102,102,49,50,57,98,57,97,50,97,99,51,50,52,48,52,98,100,50,55,98,49,48,52,53,49,102,48,57,54,55,54,48,49,49,55,50,99,53,55,48,48,97,57,48,50,98,54,102,52,53,53,50,102,100,51,56,56,52,97,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]

on storage_2:

[W] storage_2@192.168.3.55  2017-07-19 19:10:28.575244 +0300    1500480628  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz\n2">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:10:28.579190 +0300    1500480628  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:10:29.526649 +0300    1500480629  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:10:29.528093 +0300    1500480629  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz\n3">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:10:29.571610 +0300    1500480629  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:10:29.574350 +0300    1500480629  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz\n2">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:10:29.574791 +0300    1500480629  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55  2017-07-19 19:10:30.514094 +0300    1500480630  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:10:30.527017 +0300    1500480630  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz\n3">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:10:30.527362 +0300    1500480630  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55  2017-07-19 19:10:32.575664 +0300    1500480632  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/1a/47/31/1a47310303359a0d6341c0486f945c79977aee816b95ee2abb54eb94aaa130a3695993e4c22be6d7954b49fd01eb5523004a020000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:10:47.179175 +0300    1500480647  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/1a/47/31/1a47310303359a0d6341c0486f945c79977aee816b95ee2abb54eb94aaa130a3695993e4c22be6d7954b49fd01eb5523004a020000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:11:06.923457 +0300    1500480666  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/01/03/fe/0103fef5e9c71fd48065caa8f376e6c4920b8396c87019151db19479c313e13fc7e50b8856c59d674221b5d2aeade507b04c010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:11:28.693260 +0300    1500480688  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/01/03/fe/0103fef5e9c71fd48065caa8f376e6c4920b8396c87019151db19479c313e13fc7e50b8856c59d674221b5d2aeade507b04c010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:11:48.595501 +0300    1500480708  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/05/2b/30/052b30c599682a3b8b328114610e406a5458b05102cfdda2656f4095589ae588ad3773646d0c78d843b30a441e2f58409815010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:12:07.456616 +0300    1500480727  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/05/2b/30/052b30c599682a3b8b328114610e406a5458b05102cfdda2656f4095589ae588ad3773646d0c78d843b30a441e2f58409815010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:12:34.88932 +0300 1500480754  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/01/79/02/0179021507a19bc291d357e2af7c8b384ed202399013a013cc4fe4ffdcba45c0ce4330af290e19ff07c382a74d0d403628fe010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:12:34.595173 +0300    1500480754  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/40/b6/57/40b657e019055bc1cbf9b2d27051d6c7b6cfe98cafcaac76bf346c7c782188e5e6b91df2fb4ee39a2cbbea020dfd9c3238226d0000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:12:34.605214 +0300    1500480754  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/40/b6/57/40b657e019055bc1cbf9b2d27051d6c7b6cfe98cafcaac76bf346c7c782188e5e6b91df2fb4ee39a2cbbea020dfd9c3238226d0000000000.xz\n1">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:12:35.591132 +0300    1500480755  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/40/b6/57/40b657e019055bc1cbf9b2d27051d6c7b6cfe98cafcaac76bf346c7c782188e5e6b91df2fb4ee39a2cbbea020dfd9c3238226d0000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:12:35.605678 +0300    1500480755  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/40/b6/57/40b657e019055bc1cbf9b2d27051d6c7b6cfe98cafcaac76bf346c7c782188e5e6b91df2fb4ee39a2cbbea020dfd9c3238226d0000000000.xz\n1">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:12:35.606239 +0300    1500480755  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/40/b6/57/40b657e019055bc1cbf9b2d27051d6c7b6cfe98cafcaac76bf346c7c782188e5e6b91df2fb4ee39a2cbbea020dfd9c3238226d0000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55  2017-07-19 19:12:52.98514 +0300 1500480772  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/01/79/02/0179021507a19bc291d357e2af7c8b384ed202399013a013cc4fe4ffdcba45c0ce4330af290e19ff07c382a74d0d403628fe010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:13:20.645636 +0300    1500480800  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/0a/58/4d/0a584d8973d607dec93b1e71b4c3586c3fb15746de8a08dfc70a5bc4993c094629cf1077d7a1ecb6406f4f2ba8ca514d0600100000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:13:26.602468 +0300    1500480806  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/ce/2b/8a/ce2b8ade1c0bbc0e92c163b287b5269b31357a57d6f23c8245ef36898fc5cbc395fca3acbecf2d1f14defd8dbdc2a495c4f1040100000000.xz\n3">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:13:26.603372 +0300    1500480806  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/ce/2b/8a/ce2b8ade1c0bbc0e92c163b287b5269b31357a57d6f23c8245ef36898fc5cbc395fca3acbecf2d1f14defd8dbdc2a495c4f1040100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:13:26.716346 +0300    1500480806  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/fa/52/75/fa5275f6be4d2615b754f06e7cf228db7e97dc05a123a953bfe6c2a1d0137c7c81be17f9838f8d41d0ec99e1c44106e600580a0100000000.xz\n1">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:13:26.717134 +0300    1500480806  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/fa/52/75/fa5275f6be4d2615b754f06e7cf228db7e97dc05a123a953bfe6c2a1d0137c7c81be17f9838f8d41d0ec99e1c44106e600580a0100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:13:27.596297 +0300    1500480807  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/ce/2b/8a/ce2b8ade1c0bbc0e92c163b287b5269b31357a57d6f23c8245ef36898fc5cbc395fca3acbecf2d1f14defd8dbdc2a495c4f1040100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:13:27.601163 +0300    1500480807  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/ce/2b/8a/ce2b8ade1c0bbc0e92c163b287b5269b31357a57d6f23c8245ef36898fc5cbc395fca3acbecf2d1f14defd8dbdc2a495c4f1040100000000.xz\n3">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:13:27.601912 +0300    1500480807  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/ce/2b/8a/ce2b8ade1c0bbc0e92c163b287b5269b31357a57d6f23c8245ef36898fc5cbc395fca3acbecf2d1f14defd8dbdc2a495c4f1040100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55  2017-07-19 19:13:27.715564 +0300    1500480807  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/fa/52/75/fa5275f6be4d2615b754f06e7cf228db7e97dc05a123a953bfe6c2a1d0137c7c81be17f9838f8d41d0ec99e1c44106e600580a0100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:13:27.716051 +0300    1500480807  leo_storage_replicator:replicate/5  123 [{method,delete},{key,<<"bodytest/fa/52/75/fa5275f6be4d2615b754f06e7cf228db7e97dc05a123a953bfe6c2a1d0137c7c81be17f9838f8d41d0ec99e1c44106e600580a0100000000.xz\n1">>},{cause,timeout}]
[W] storage_2@192.168.3.55  2017-07-19 19:13:27.716423 +0300    1500480807  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/fa/52/75/fa5275f6be4d2615b754f06e7cf228db7e97dc05a123a953bfe6c2a1d0137c7c81be17f9838f8d41d0ec99e1c44106e600580a0100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55  2017-07-19 19:13:34.914364 +0300    1500480814  leo_storage_replicator:loop/6   216 [{method,delete},{key,<<"bodytest/0a/58/4d/0a584d8973d607dec93b1e71b4c3586c3fb15746de8a08dfc70a5bc4993c094629cf1077d7a1ecb6406f4f2ba8ca514d0600100000000000.xz">>},{cause,timeout}]
[E] storage_2@192.168.3.55  2017-07-19 19:13:37.717090 +0300    1500480817  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,16,177,15,204,176,30,198,228,202,120,84,124,144,181,60,18,109,0,0,0,133,98,111,100,121,116,101,115,116,47,102,97,47,53,50,47,55,53,47,102,97,53,50,55,53,102,54,98,101,52,100,50,54,49,53,98,55,53,52,102,48,54,101,55,99,102,50,50,56,100,98,55,101,57,55,100,99,48,53,97,49,50,51,97,57,53,51,98,102,101,54,99,50,97,49,100,48,49,51,55,99,55,99,56,49,98,101,49,55,102,57,56,51,56,102,56,100,52,49,100,48,101,99,57,57,101,49,99,52,52,49,48,54,101,54,48,48,53,56,48,97,48,49,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]

(this leo_delete_dir_queue_1_1 line is present only on storage_1 and storage_2, but all nodes failed to switch from "enqueueing" to "monitoring").

There is nothing in info logs except for original msg: enqueued at start and lots of {cause,"slow operation"},{method,head}. The "dequeued and removed" message is not present on any node.

I've deleted second bucket after that and the result was the same, no "dequeued and removed" message on any nodes, state at manager is stuck at "enqueuing".

There is another problem which I'm trying to confirm (maybe you could look at it from code side as well?), quite minor, but still. This is output of "du" command right after deletion of second bucket is finished:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_0@192.168.3.53
 active number of objects: 0
  total number of objects: 1354898
   active size of objects: 0
    total size of objects: 191667623380
     ratio of active size: 0.0%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__
[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_1@192.168.3.54
 active number of objects: 2035
  total number of objects: 1509955
   active size of objects: 235607379
    total size of objects: 213421617475
     ratio of active size: 0.11%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

This is after compaction:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_0@192.168.3.53
 active number of objects: 0
  total number of objects: 0
   active size of objects: 0
    total size of objects: 0
     ratio of active size: 0%
    last compaction start: 2017-07-19 21:15:41 +0300
      last compaction end: 2017-07-19 21:24:38 +0300
[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_1@192.168.3.54
 active number of objects: 0
  total number of objects: 0
   active size of objects: 0
    total size of objects: 0
     ratio of active size: 0%
    last compaction start: 2017-07-19 21:17:59 +0300
      last compaction end: 2017-07-19 21:27:16 +0300

It's as if counters for storage_1 got broken in process of deletion. I'm trying to verify if they were correct from the start right now.

On a positive side: the performance is way higher compared to old implementation, e.g. it takes less than 10 minutes to enqueue 600,000 deletes and 20 minutes or so to actually delete that data, deletes are mostly consumed at steady 400-500 messages per second (doesn't seem to be dependent on "mq interval" parameter). Also, 100% of objects were removed from both buckets.

EDIT: as an experiment, on the same cluster without objects at all (and deletes finished and compaction is performed, /mnt/avs occupies less than 1 MB, all queues are empty as well) I've executed

[root@leo-m0 ~]# /usr/local/bin/leofs-adm reset-delete-bucket-stats bodytest
[root@leo-m0 ~]# /usr/local/bin/leofs-adm reset-delete-bucket-stats body

then on my system

$ s3cmd mb s3://bodytest
$ s3cmd rb s3://bodytest

The manager now shows

[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node                         | node             | state                       
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55       | enqueuing        | 2017-07-19 21:49:34 +0300               
storage_1@192.168.3.54       | enqueuing        | 2017-07-19 21:49:34 +0300               
storage_0@192.168.3.53       | enqueuing        | 2017-07-19 21:49:34 +0300               

and that's it. Nothing in logs on storage nodes at all, not even "msg: enqueued" line! Also - I see minor spikes in CPU usage on nodes from time to time, here is leo_doctor log: https://pastebin.com/1KNpix5n

mocchira commented 7 years ago

@vstax Thanks for testing.

Just doing "delete-bucket" here; "repeating" as in repeating the experiment because I rolled back managers as well to the state when "delete-bucket" was never executed.

Got it.

I've tried deleting the same bucket with these changes, and I don't think that fix for "finish notification" works; the state remains "enqueuing" even 1 hour after delete operation has finished. This happens for all storage nodes: and that's it. Nothing in logs on storage nodes at all, not even "msg: enqueued" line! Also - I see minor spikes in CPU usage on nodes from time to time, here is leo_doctor log: https://pastebin.com/1KNpix5n

Since there is a non backward compatible change around delete-stats handling on leo_storage, please also wipe out del_dir_queue on every leo_storage and try again with the clean state. That should work for you or please let me know.

On a positive side: the performance is way higher compared to old implementation, e.g. it takes less than 10 minutes to enqueue 600,000 deletes and 20 minutes or so to actually delete that data, deletes are mostly consumed at steady 400-500 messages per second (doesn't seem to be dependent on "mq interval" parameter). Also, 100% of objects were removed from both buckets.

Great to hear that :)

vstax commented 7 years ago

@mocchira Thank you for suggestion. However, it looks like I'm still having the same problem. Actually, deleting queue/del_dir definitely does help, for example in the last experiment - when I was trying to create and remove bucket on nodes without data - I started to get "enqueued" and "dequeued" messages and state at manager changes reliably. However, in the original experiment, where I do deletion of bigger bucket - the problem remains. I've removed queue/del_dir before starting nodes and deleted bucket, here are details.

Logs for storage_0, info and error (here and for other nodes I removed all "slow operation" lines from info logs and "{method,delete} .. {cause,timeout}" from error logs as there were too many of them):

[I] storage_0@192.168.3.53  2017-07-20 19:26:43.86057 +0300 1500568003  leo_storage_handler_del_directory:run/5 141 [{"msg: enqueued",<<"bodytest">>}]
[I] storage_0@192.168.3.53  2017-07-20 20:05:34.652878 +0300    1500570334  leo_storage_handler_del_directory:run/5 575 [{"msg: dequeued and removed (bucket)",<<"bodytest">>}]
[W] storage_0@192.168.3.53  2017-07-20 19:30:18.378932 +0300    1500568218  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/00/da/65/00da6530e7432014585caaeae4af62590edfa53204fa5a0d9bc93fc61012eef837b7a62455faa9482aba94bce7b5012a805a5d0000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_0@192.168.3.53  2017-07-20 19:30:27.509573 +0300    1500568227  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/70/01/39/700139f980d397cff8b770b72b53ae91474a7abcb59df0b2d246bbde6178b3b7a173e2cc6b16cd13922f61a5a12609404003630000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[E] storage_0@192.168.3.53  2017-07-20 19:33:21.198370 +0300    1500568401  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:33:21.218434 +0300    1500568401  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:33:21.219006 +0300    1500568401  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.526.0> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:33:21.247327 +0300    1500568401  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_3,{remove,<<131,104,2,110,16,0,10,212,194,135,40,132,128,5,96,246,93,208,19,74,201,203,109,0,0,0,133,98,111,100,121,116,101,115,116,47,50,57,47,97,55,47,57,98,47,50,57,97,55,57,98,101,50,52,55,102,49,52,50,97,50,55,49,51,98,98,56,101,101,57,54,57,98,57,55,52,98,56,55,98,101,53,53,48,100,54,102,55,101,100,52,101,57,101,49,50,97,100,56,54,100,56,50,102,54,98,51,50,48,51,50,54,50,102,100,48,54,98,49,100,97,99,49,52,51,100,98,97,52,57,99,57,51,52,52,100,52,57,57,50,57,102,56,50,101,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53  2017-07-20 19:33:42.271389 +0300    1500568422  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:33:42.272357 +0300    1500568422  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:33:42.272940 +0300    1500568422  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.24218.8> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:34:03.565223 +0300    1500568443  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:34:03.565692 +0300    1500568443  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:34:03.566404 +0300    1500568443  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.7324.9> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:34:03.661557 +0300    1500568443  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_2,{remove,<<131,104,2,110,16,0,16,156,221,202,70,24,61,7,88,75,118,163,245,145,8,86,109,0,0,0,133,98,111,100,121,116,101,115,116,47,49,57,47,49,49,47,102,50,47,49,57,49,49,102,50,99,53,53,100,101,102,54,97,54,50,99,53,97,51,102,97,49,98,98,56,100,98,102,97,101,54,56,54,56,99,101,99,53,100,97,53,54,57,98,99,49,99,99,51,53,97,97,50,102,49,50,49,50,98,53,56,48,97,101,56,49,54,98,52,101,56,51,97,100,100,51,100,56,99,101,55,99,56,56,57,57,97,48,48,49,101,51,56,97,53,52,52,57,97,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53  2017-07-20 19:34:16.569535 +0300    1500568456  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:34:16.570040 +0300    1500568456  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:34:16.570461 +0300    1500568456  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.22497.9> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:34:16.576021 +0300    1500568456  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,21,138,133,115,192,150,177,167,10,16,74,198,96,170,11,205,109,0,0,0,133,98,111,100,121,116,101,115,116,47,100,100,47,98,55,47,97,100,47,100,100,98,55,97,100,49,98,97,48,49,48,48,53,53,50,49,48,49,48,100,98,51,101,51,48,97,100,99,57,55,50,102,49,98,55,51,55,54,102,49,97,51,52,56,100,55,48,55,51,97,98,51,101,55,99,52,54,50,57,102,49,51,101,49,48,97,97,102,50,98,51,55,101,54,56,102,48,49,100,49,52,53,99,56,99,48,49,99,97,49,98,102,100,99,98,48,48,102,48,48,51,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53  2017-07-20 19:34:47.99576 +0300 1500568487  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:34:47.100200 +0300    1500568487  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:34:47.100717 +0300    1500568487  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.31476.9> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:35:00.102687 +0300    1500568500  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:35:00.103131 +0300    1500568500  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:35:00.103849 +0300    1500568500  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.21760.10> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:35:00.104430 +0300    1500568500  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,24,7,173,240,79,26,10,158,192,167,84,162,9,108,241,135,109,0,0,0,133,98,111,100,121,116,101,115,116,47,100,48,47,50,51,47,48,99,47,100,48,50,51,48,99,97,49,54,53,56,50,99,100,50,57,102,101,55,99,52,97,50,98,54,97,55,56,101,49,57,54,102,102,49,55,56,99,98,55,52,49,99,50,56,99,53,99,48,99,50,52,98,102,99,52,98,52,98,100,100,100,50,100,102,53,100,52,50,52,99,54,98,99,50,99,99,100,52,99,99,102,52,50,52,57,52,55,99,48,52,54,48,97,52,51,53,56,55,52,48,49,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53  2017-07-20 19:35:41.305520 +0300    1500568541  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:35:41.306059 +0300    1500568541  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_4",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:35:41.306489 +0300    1500568541  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.32632.10> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:35:41.381384 +0300    1500568541  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_4,{remove,<<131,104,2,110,16,0,19,5,174,44,231,161,153,57,126,104,105,200,4,187,126,79,109,0,0,0,133,98,111,100,121,116,101,115,116,47,101,101,47,52,49,47,100,97,47,101,101,52,49,100,97,101,97,99,53,57,50,54,55,101,51,49,52,98,53,54,48,49,51,55,98,49,55,56,57,55,100,48,56,49,54,53,54,52,54,53,53,52,56,97,48,51,52,100,55,56,52,57,100,49,50,55,50,56,98,99,57,99,55,55,97,97,97,101,101,48,49,52,52,101,98,54,51,51,100,102,54,50,101,53,98,49,51,101,49,97,99,50,102,97,98,99,48,50,102,48,49,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53  2017-07-20 19:36:20.311567 +0300    1500568580  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:36:20.312132 +0300    1500568580  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_4",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:36:20.312568 +0300    1500568580  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.378.12> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:36:41.793761 +0300    1500568601  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:36:41.795603 +0300    1500568601  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:36:41.795910 +0300    1500568601  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.29776.12> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:36:54.805977 +0300    1500568614  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:36:54.806443 +0300    1500568614  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:36:54.807702 +0300    1500568614  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.11474.13> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:36:54.808481 +0300    1500568614  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,29,18,190,9,50,26,84,237,156,114,100,101,24,63,18,136,109,0,0,0,133,98,111,100,121,116,101,115,116,47,50,51,47,99,98,47,56,50,47,50,51,99,98,56,50,102,57,100,50,53,52,101,102,48,98,98,48,53,53,53,54,50,53,57,101,97,53,55,97,51,50,55,102,55,51,97,102,54,54,49,52,101,57,53,57,50,98,101,99,57,54,100,52,56,57,50,53,53,52,99,49,102,52,53,52,99,97,48,53,57,100,48,100,53,48,55,98,56,51,100,54,102,101,102,101,100,53,52,48,50,53,55,49,97,99,48,99,57,97,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53  2017-07-20 19:37:23.723342 +0300    1500568643  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:37:23.723937 +0300    1500568643  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:37:23.724411 +0300    1500568643  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.19939.13> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:37:43.835582 +0300    1500568663  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:37:43.836148 +0300    1500568663  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:37:43.837542 +0300    1500568663  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.9223.14> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:38:04.918906 +0300    1500568684  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:38:04.920371 +0300    1500568684  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:38:04.921132 +0300    1500568684  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.22232.14> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:38:26.438963 +0300    1500568706  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:38:26.441240 +0300    1500568706  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:38:26.441819 +0300    1500568706  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.5686.15> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:39:31.659653 +0300    1500568771  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:39:31.660134 +0300    1500568771  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:39:31.661283 +0300    1500568771  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.20661.15> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:40:01.773351 +0300    1500568801  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:40:01.773860 +0300    1500568801  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:40:01.774190 +0300    1500568801  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.4586.17> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:40:22.745567 +0300    1500568822  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:40:22.746069 +0300    1500568822  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:40:22.746673 +0300    1500568822  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.28351.17> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:40:42.852335 +0300    1500568842  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:40:42.852770 +0300    1500568842  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:40:42.853120 +0300    1500568842  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.11759.18> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:40:55.854837 +0300    1500568855  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:40:55.855802 +0300    1500568855  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:40:55.856360 +0300    1500568855  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.28083.18> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:40:55.858391 +0300    1500568855  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,42,18,96,104,151,33,21,195,99,36,241,255,203,123,187,233,109,0,0,0,133,98,111,100,121,116,101,115,116,47,54,100,47,54,53,47,51,56,47,54,100,54,53,51,56,102,54,98,49,48,49,97,102,97,57,102,51,54,54,98,98,49,99,97,101,51,49,99,102,52,102,53,101,99,51,102,53,54,55,57,101,49,50,100,97,50,101,55,49,102,49,52,48,51,51,101,102,54,51,48,98,50,52,56,50,52,51,51,53,56,53,55,98,52,51,102,51,98,53,97,98,102,99,102,101,101,48,54,98,56,48,49,48,53,48,100,99,53,97,48,49,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53  2017-07-20 19:41:08.860290 +0300    1500568868  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:41:08.860750 +0300    1500568868  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:41:08.861189 +0300    1500568868  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.3751.19> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:41:08.864488 +0300    1500568868  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,42,93,84,246,102,162,36,191,248,134,50,209,160,246,12,89,109,0,0,0,133,98,111,100,121,116,101,115,116,47,57,56,47,51,56,47,98,101,47,57,56,51,56,98,101,51,97,101,100,98,51,48,55,97,56,53,48,100,53,97,100,57,52,99,101,102,57,102,101,50,55,101,97,56,51,56,50,54,52,51,101,54,100,101,99,53,54,49,57,54,55,56,57,49,50,50,100,101,48,57,99,55,100,48,52,100,51,53,102,54,54,48,56,53,55,54,49,56,99,55,99,53,99,56,100,55,53,102,102,50,55,50,49,51,99,48,48,53,48,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53  2017-07-20 19:41:21.862490 +0300    1500568881  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:41:21.865461 +0300    1500568881  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:41:21.866002 +0300    1500568881  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.11986.19> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:41:21.868401 +0300    1500568881  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,42,253,124,3,146,58,153,107,170,73,196,200,107,124,70,12,109,0,0,0,133,98,111,100,121,116,101,115,116,47,53,97,47,53,98,47,56,49,47,53,97,53,98,56,49,56,50,99,98,102,53,101,53,98,50,49,99,49,57,102,49,49,57,55,50,52,99,50,48,55,98,97,56,54,97,48,52,101,55,97,97,55,102,49,97,98,55,102,52,56,100,99,101,53,97,48,50,98,48,99,55,52,50,51,48,50,53,99,49,57,54,98,97,53,97,101,48,57,97,99,50,49,54,56,53,100,101,55,97,49,51,101,52,54,53,48,48,56,50,100,97,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53  2017-07-20 19:41:43.318645 +0300    1500568903  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:41:43.319131 +0300    1500568903  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:41:43.319606 +0300    1500568903  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.20867.19> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:42:13.168277 +0300    1500568933  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:42:13.168686 +0300    1500568933  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:42:13.169027 +0300    1500568933  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.2797.20> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:42:44.42270 +0300 1500568964  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:42:44.42725 +0300 1500568964  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:42:44.43206 +0300 1500568964  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.22965.20> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:42:44.116878 +0300    1500568964  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_3,{remove,<<131,104,2,110,16,0,41,233,152,250,25,234,185,3,159,249,195,75,6,193,234,149,109,0,0,0,133,98,111,100,121,116,101,115,116,47,53,99,47,55,99,47,48,101,47,53,99,55,99,48,101,102,56,52,98,97,51,55,57,53,56,101,57,102,52,55,101,54,56,54,54,55,51,48,51,54,50,100,53,52,99,57,55,48,98,99,49,100,55,97,56,52,102,55,99,99,55,55,102,57,53,56,98,99,53,49,98,100,48,51,52,53,54,52,49,53,49,49,100,100,52,99,97,101,48,102,55,49,55,53,56,49,97,52,51,50,57,48,54,51,97,49,56,99,48,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53  2017-07-20 19:43:43.107190 +0300    1500569023  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:43:43.107555 +0300    1500569023  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:43:43.107844 +0300    1500569023  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.12801.21> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:44:55.487350 +0300    1500569095  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:44:55.488374 +0300    1500569095  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_4",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:44:55.490281 +0300    1500569095  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.27006.22> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:45:44.440394 +0300    1500569144  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:45:44.446359 +0300    1500569144  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:45:44.450574 +0300    1500569144  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.16498.24> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:45:44.454278 +0300    1500569144  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,57,247,169,22,234,112,135,7,148,188,119,50,58,181,122,117,109,0,0,0,133,98,111,100,121,116,101,115,116,47,53,53,47,54,54,47,101,102,47,53,53,54,54,101,102,97,53,53,56,52,100,97,48,97,51,50,53,57,102,101,57,98,48,99,57,100,57,102,97,98,57,52,57,55,98,99,49,57,99,102,49,53,50,98,100,56,50,51,53,101,102,49,55,53,50,54,56,51,53,99,49,97,102,101,57,55,55,56,100,50,101,102,97,57,54,48,101,98,50,56,97,97,97,54,49,56,102,55,50,54,101,97,51,52,99,57,52,52,56,48,49,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53  2017-07-20 19:49:44.541355 +0300    1500569384  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:49:44.542521 +0300    1500569384  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:49:44.543099 +0300    1500569384  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.23966.25> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53  2017-07-20 19:52:08.383527 +0300    1500569528  null:null   0   gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53  2017-07-20 19:52:08.384080 +0300    1500569528  null:null   0   ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_4",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53  2017-07-20 19:52:08.384527 +0300    1500569528  null:null   0   Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.15232.31> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in context child_terminated

Logs for storage_1:

[I] storage_1@192.168.3.54  2017-07-20 19:26:43.56260 +0300 1500568003  leo_storage_handler_del_directory:run/5 141 [{"msg: enqueued",<<"bodytest">>}]
[I] storage_1@192.168.3.54  2017-07-20 19:29:40.188688 +0300    1500568180  null:null   0   ["alarm_handler",58,32,"{set,{system_memory_high_watermark,[]}}"]
[W] storage_1@192.168.3.54  2017-07-20 19:28:18.878858 +0300    1500568098  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/52/1d/82/521d82ec1c121d384f0ae426745289d7d48359eab6c2cd99c176419fd897a17875d3f2f6fef5584dea4f130504d90fee4866770000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-20 19:28:57.779793 +0300    1500568137  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-20 19:29:36.167953 +0300    1500568176  leo_storage_handler_object:replicate_fun/3  1406    [{cause,"Could not get a metadata"}]
[E] storage_1@192.168.3.54  2017-07-20 19:29:36.169588 +0300    1500568176  leo_storage_handler_object:put/4    423 [{from,storage},{method,delete},{key,<<"bodytest/f7/68/1f/f7681f320d1c516981326302016b6fcabf63e441cf19cc373d482bb7a4cb531d7d76e7b7fbfa1079b96749123d43c85808daa40000000000.xz\n1">>},{req_id,0},{cause,"Could not get a metadata"}]
[W] storage_1@192.168.3.54  2017-07-20 19:29:37.760044 +0300    1500568177  leo_storage_handler_object:replicate_fun/3  1406    [{cause,"Could not get a metadata"}]
[E] storage_1@192.168.3.54  2017-07-20 19:29:37.760725 +0300    1500568177  leo_storage_handler_object:put/4    423 [{from,storage},{method,delete},{key,<<"bodytest/de/68/50/de6850f34b4e62d52496f47186e20a4932c0c570c961c24262543031d73275707da1710e2475f1b0c1f9c3cbbb273ca8d3dc020100000000.xz\n4">>},{req_id,0},{cause,"Could not get a metadata"}]
[W] storage_1@192.168.3.54  2017-07-20 19:29:40.642703 +0300    1500568180  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/e5/a3/84/e5a384f216d906661267a25f00c7e91f69d61e0b628d527940226028734d07dd7fc12da6dcb17936a2ac13ff1a15087248d5df0000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-20 19:30:21.26429 +0300 1500568221  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/55/dd/da/55dddad03e531499b6844655a7621ad3c1fa4fd1653de1187d2055a5d4edc719250d77b7311852571820300783c73ee2f7fb030100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-20 19:30:23.334339 +0300    1500568223  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/cb/f1/83/cbf183675706473d4b747c12281bf50fc541a238ca442b2e14fad0eb71476657aa0cfa79aed45ab98f195745aabf8885b843580000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-20 19:31:12.979853 +0300    1500568272  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/9d/94/18/9d9418650983a9e30bd7f13a673ca730b2f0d4768c9219216295d63dcee018c0b987bc4953cf8ae704008d8453ba54eba8fc700000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-20 19:31:22.553110 +0300    1500568282  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/c1/cb/49/c1cb49681fc5b5faecef846d33ac51381a3072bf18f7ddb96cb4e460124d6533f1fdf7d85bc6a85fb5236068b552184bbfc7020100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54  2017-07-20 19:32:04.918653 +0300    1500568324  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/8f/8d/7c/8f8d7c273b8eadb4aeab29726ff21a5f39e5f75ef05ce9185afd769cc9f22bb8f0be0780dc30f595fc7e350dd946ffa880e73f0100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[E] storage_1@192.168.3.54  2017-07-20 19:32:20.672358 +0300    1500568340  leo_mq_consumer:consume/4   526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,18,38,11,47,88,189,190,69,27,89,90,85,231,238,224,105,109,0,0,0,133,98,111,100,121,116,101,115,116,47,100,51,47,98,50,47,50,53,47,100,51,98,50,50,53,100,50,100,52,101,57,100,51,48,102,55,100,50,51,98,98,53,99,49,100,51,102,98,102,48,49,50,102,50,98,51,56,56,100,49,57,50,101,49,97,101,56,98,51,98,48,55,48,100,101,97,52,98,56,97,48,53,57,102,98,53,100,53,48,56,53,50,51,50,100,49,56,51,57,97,98,98,51,97,49,54,56,99,101,51,100,50,51,53,97,100,99,50,98,48,50,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]

Logs for storage_2:

[I] storage_2@192.168.3.55  2017-07-20 19:26:43.65430 +0300 1500568003  leo_storage_handler_del_directory:run/5 141 [{"msg: enqueued",<<"bodytest">>}]
[W] storage_2@192.168.3.55  2017-07-20 19:28:56.735586 +0300    1500568136  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz\n2">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_3,{get,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz\n2">>},30000]}}}]
[W] storage_2@192.168.3.55  2017-07-20 19:28:57.727733 +0300    1500568137  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55  2017-07-20 19:31:17.784960 +0300    1500568277  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/40/b6/57/40b657e019055bc1cbf9b2d27051d6c7b6cfe98cafcaac76bf346c7c782188e5e6b91df2fb4ee39a2cbbea020dfd9c3238226d0000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55  2017-07-20 19:31:56.121550 +0300    1500568316  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/ef/98/4b/ef984b782734a9509889ccc4e318767bdcdeb8509e47612d87f175836c9be96d63e1a9624a05662379b356b486203430b99e640000000000.xz\n2">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_7,{get,<<"bodytest/ef/98/4b/ef984b782734a9509889ccc4e318767bdcdeb8509e47612d87f175836c9be96d63e1a9624a05662379b356b486203430b99e640000000000.xz\n2">>},30000]}}}]
[W] storage_2@192.168.3.55  2017-07-20 19:32:03.179147 +0300    1500568323  leo_storage_replicator:replicate_fun/2  243 [{key,<<"bodytest/ce/2b/8a/ce2b8ade1c0bbc0e92c163b287b5269b31357a57d6f23c8245ef36898fc5cbc395fca3acbecf2d1f14defd8dbdc2a495c4f1040100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]

The problem: for storage_1 and storage_2 state never changed from "enqueued". For example, here are stats when bucket deletion was basically done on storage_1 and storage_2 (queues were empty, "du" showing ratio of active size around 50%) but there was still ~150K messages in storage_0 queue:

[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node                         | node             | state
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55       | enqueuing        | 2017-07-20 19:58:13 +0300
storage_1@192.168.3.54       | enqueuing        | 2017-07-20 19:58:11 +0300
storage_0@192.168.3.53       | monitoring       | 2017-07-20 19:58:07 +0300

This is current state (all queues are empty):

[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node                         | node             | state
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55       | enqueuing        | 2017-07-20 20:34:54 +0300
storage_1@192.168.3.54       | enqueuing        | 2017-07-20 20:34:55 +0300
storage_0@192.168.3.53       | finished         | 2017-07-20 20:05:34 +0300

EDIT: I deleted another bucket and it's the same situation, storage_0 "finished" while storage_1 and 2 stuck in "enqueuing". I packed "del_dir" contents on storage_1 after that and uploaded to https://www.dropbox.com/s/5qckg363tuuhcow/del_dir.tar.gz?dl=0 - will that be helpful? (do you use some tool to dump contents of these queues? I tried https://github.com/tgulacsi/leveldb-tools but it complains about corrupted "zero header" on MANIFEST file)

I've also confirmed that my assumption from previous comment (https://github.com/leo-project/leofs/issues/725#issuecomment-316479835) is correct and counters on storage_1 really went off during that bucket deletion experiment; they were fine on storage_0, however. I rolled back to original state and did compaction on storage_1 to verify counters, and values were the same. That means that there was no error before the experiment, and the error (2035 remaining objects per counters, while in reality there were 0 left) appeared during the course of bucket deletion.

Another note: looks like delete-bucket-stats automatically wipes the stats some time after state is "finished" for all nodes (not sure when, 1 or 2 hours maybe?). Which is fine but probably should be mentioned in documentation for that command so that no one gets confused?