Open vstax opened 7 years ago
WIP
@vstax Thanks for reporting in detail.
Timeouts on gateway - not really a problem, as long as operation goes on asynchronously
As I commented on the above, there are some problems.
Typos "mehtod,delete", "mehtod,head", "mehtod,fetch" in info log. Note that it's correct in error log :)
This is not typos (method head/fetch are used during a delete bucket operation internally).
The fact that delete operation did not complete (I have picked a ~4300 random object names and executed "whereis" for them; around 1750 of them was marked as "deleted" on all nodes and around 2500 weren't deleted on any of them).
As I answered at the above question, the restart can cause delete operations to stop in the middle.
The fact that delete queues got stuck. How do I "unfreeze" them? Reboot storage nodes? (not a problem, I'm just keeping them like that for now in case there is something else to try). There no errors or anything right now (however, debug logs are no enabled); state of all nodes is "running", but delete queue is not being processed on storage_1 and storage_2.
It seems actually delete queues are freed up however the number mq-stats displays is non-zero. This kind of inconsistency between the actual items in a queue and the number mq-stats display can happen in case the restart happen on leo_storage. We will get over this inconsistency problem somehow.
EDIT: filed this inconsistency problem on https://github.com/leo-project/leofs/issues/731
These lines in log of storage_1
WIP.
I've made a deletion bucket processing's diagram to clarify how to fix this issue, whose diagram covers #150.
@mocchira @yosukehara Thank you for analyzing.
This seems like a complicated issue; when looking at #150 and #701 I thought this is supposed to work as long as I don't create bucket with the same name again, but apparently I had too much hopes.
Too much retries going on in parallel behind the scene
I can't do anything about retries from leo_gateway to leo_storage, but I can try to do it with different S3 client which will only do "delete bucket" operation once without retries and share if it works any better. However, I've stumbled into something else regarding queues so I'll leave everything be for now..
This is not typos (method head/fetch are used during a delete bucket operation internally).
No, not that one. "mehtod" part is a typo. From here: https://github.com/leo-project/leo_object_storage/blob/develop/src/leo_object_storage_event.erl#L55
It seems actually delete queues are freed up however the number mq-stats displays is non-zero. This kind of inconsistency between the actual items in a queue and the number mq-stats display can happen in case the restart happen on leo_storage. We will get over this inconsistency problem somehow.
I've restarted managers, queue size didn't change. I've restarted storage_1 - and the queue started to reduce. I got ~100% CPU usage on that node.
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
leo_async_deletion_queue | idling | 53556 | 1440 | 550 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
leo_async_deletion_queue | idling | 53556 | 1280 | 600 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
leo_async_deletion_queue | idling | 53556 | 800 | 750 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
leo_async_deletion_queue | running | 52980 | 480 | 850 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
leo_async_deletion_queue | running | 52800 | 480 | 850 | async deletion of objs
After a minute or two I got two errors in error.log of storage_1:
[W] storage_1@192.168.3.54 2017-05-10 13:17:43.733429 +0300 1494411463 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/2d/2b/d8/2d2bd8b1f701626d70f4f253384d748463a48f2007e61da5c104630267c049908977838a1028e268132585cbd268f3ca03c4020100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-10 13:17:44.732449 +0300 1494411464 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/2d/2b/d8/2d2bd8b1f701626d70f4f253384d748463a48f2007e61da5c104630267c049908977838a1028e268132585cbd268f3ca03c4020100000000.xz">>},{cause,timeout}]
The CPU usage went to 0, the queue "froze" again. But half a minute error I see the same 100% CPU usage again. Then it goes to 0 again. Then high again. All this time, the "number of msgs" in queue doesn't change, but "interval" number changes:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
leo_async_deletion_queue | idling | 52440 | 0 | 1450 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
leo_async_deletion_queue | idling | 52440 | 0 | 1700 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
leo_async_deletion_queue | idling | 52440 | 0 | 2200 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
leo_async_deletion_queue | idling | 52440 | 0 | 2900 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
leo_async_deletion_queue | running | 52440 | 0 | 2950 | async deletion of objs
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
leo_async_deletion_queue | idling | 52440 | 0 | 3000 | async deletion of objs
After this, at some point the mq-stats
command for this node started to work really slow, taking 7-10 seconds to respond, if executed during period of 100% CPU usage. Nothing else in error logs all this time. I see the same values (52400 / 0 / 3000 for async deletion queue, 0 / 0 / 0 for all others) but it takes 10 seconds to respond. It's still fast during 0% CPU usage period, but since node switches between these all the time now it's pretty random.
I had debug logs enabled, I saw lots of lines in storage_1 debug log during this time. At first it was like this:
[D] storage_1@192.168.3.54 2017-05-10 13:17:13.74131 +0300 1494411433 leo_storage_handler_object:delete/4 596 [{from,leo_mq},{method,del},{key,<<"bodytest/72/80/4f/72804f11dd276935ff759f28e4363761b6b2311ab33ffb969a41d33610c17a78e56971eeaa283bc5724ebff74c9797a27822010000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D] storage_1@192.168.3.54 2017-05-10 13:17:13.74432 +0300 1494411433 leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/5b/e0/39/5be039360a4f0050e39c44eafde1ba847bd54593885605f22e06f4ee351e081cf75e5820483bbb11e6350d7cd2853542c495000000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54 2017-05-10 13:17:13.74707 +0300 1494411433 leo_storage_handler_object:delete/4 596 [{from,leo_mq},{method,del},{key,<<"bodytest/5b/e0/39/5be039360a4f0050e39c44eafde1ba847bd54593885605f22e06f4ee351e081cf75e5820483bbb11e6350d7cd2853542c495000000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D] storage_1@192.168.3.54 2017-05-10 13:17:13.74915 +0300 1494411433 leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/39/58/a6/3958a6e0b7e1f33eaec7c5634498bb65579d13b8dff8983943b9144359b74206627e398b44a7ab0a29eb00169a9651230600100000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54 2017-05-10 13:17:13.75166 +0300 1494411433 leo_storage_handler_object:delete/4 596 [{from,leo_mq},{method,del},{key,<<"bodytest/39/58/a6/3958a6e0b7e1f33eaec7c5634498bb65579d13b8dff8983943b9144359b74206627e398b44a7ab0a29eb00169a9651230600100000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D] storage_1@192.168.3.54 2017-05-10 13:17:13.75400 +0300 1494411433 leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/2d/2b/d8/2d2bd8b1f701626d70f4f253384d748463a48f2007e61da5c104630267c049908977838a1028e268132585cbd268f3ca03c4020100000000.xz">>},{req_id,0}]
Then (note a gap in time! This - 13:25 - is few minutes after the queue got "stuck" at 52404 number. Could it be that something restarted and queue "unstuck" for a moment here?):
[D] storage_1@192.168.3.54 2017-05-10 13:18:02.921132 +0300 1494411482 leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/11/f3/aa/11f3aafb5d279afbcbb0ad9ff76a24f806c5fa1bd64eb54691629363dd0771394f81e4eb216e489d5169395736e80d992078020000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54 2017-05-10 13:18:02.922308 +0300 1494411482 leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/7a/e0/82/7ae0820cb42d3224fc9ac54b86e6f4c21ea567c81c91d65f524cd27e4777cb5fd3ff4d415ec8b2529c4da616f58b830ec844010000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54 2017-05-10 13:27:18.952873 +0300 1494412038 null:null 0 Supervisor inet_gethost_native_sup started undefined at pid <0.10159.0>
[D] storage_1@192.168.3.54 2017-05-10 13:27:18.953587 +0300 1494412038 null:null 0 Supervisor kernel_safe_sup started inet_gethost_native:start_link() at pid <0.10158.0>
[D] storage_1@192.168.3.54 2017-05-10 13:27:52.990768 +0300 1494412072 leo_storage_handler_object:put/4 404 [{from,storage},{method,delete},{key,<<"bodytest/b1/28/81/b12881f64bd8bb9e7382dc33bad442cdc91b0372bcdbbf1dcbd9bacda421e9a2ee24d479dba47d346c0b89bc06e74dc62540010000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54 2017-05-10 13:27:52.995161 +0300 1494412072 leo_storage_handler_object:put/4 404 [{from,storage},{method,delete},{key,<<"bodytest/8a/7a/71/8a7a715855dabae364d61c1c05a5872079a5ca82588e894fdc83c647530c50cb0c910981b2b4cf62ac9625983fee7661d840010000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54 2017-05-10 13:27:52.998699 +0300 1494412072 leo_storage_handler_object:put/4 404 [{from,storage},{method,delete},{key,<<"bodytest/96/35/56/963556c85b8a97d1d6d6b3a5f33f649dcdd6c9d89729c7c517d364f8c498eb5e214c1af2d694299d50f504f42f31fd60a816010000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54 2017-05-10 13:27:53.294 +0300 1494412073 leo_storage_handler_object:put/4 404 [{from,storage},{method,delete},{key,<<"bodytest/5a/3a/e0/5a3ae0c07352fdf97d3720e4afdec76ba4c3e2f60ede654f675ce68e9b5f749fd40e6bc1b3f5855c1c085402c0b3ece9a0ef000000000000.xz">>},{req_id,0}]
At some point (13:28:40 to be precise), messages have stopped appearing.
I've repeated experiment with storage_2 and the situation at first was exactly the same, just with different numbers. However, unlike storage_1 there are other messages in error log:
[E] storage_2@192.168.3.55 2017-05-10 13:30:04.679350 +0300 1494412204 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_2@192.168.3.55 2017-05-10 13:30:06.182672 +0300 1494412206 null:null 0 Error in process <0.23852.0> on node 'storage_2@192.168.3.55' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_2@192.168.3.55 2017-05-10 13:30:06.232671 +0300 1494412206 null:null 0 Error in process <0.23853.0> on node 'storage_2@192.168.3.55' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_2@192.168.3.55 2017-05-10 13:30:09.680281 +0300 1494412209 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_2@192.168.3.55 2017-05-10 13:30:14.681474 +0300 1494412214 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
The last line repeats lots of times, endlessly. I can't execute "mq-stats" for this node anymore: it returns instantly without any results (like it happens when a node isn't running). However, its status is indeed "running":
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55
[root@leo-m0 ~]# /usr/local/bin/leofs-adm status |grep storage_2
S | storage_2@192.168.3.55 | running | c1d863d0 | c1d863d0 | 2017-05-10 13:27:51 +0300
[root@leo-m0 ~]# /usr/local/bin/leofs-adm status storage_2@192.168.3.55
--------------------------------------+--------------------------------------
Item | Value
--------------------------------------+--------------------------------------
Config-1: basic
--------------------------------------+--------------------------------------
version | 1.3.4
number of vnodes | 168
object containers | - path:[/mnt/avs], # of containers:8
log directory | /var/log/leofs/leo_storage/erlang
log level | debug
--------------------------------------+--------------------------------------
Config-2: watchdog
--------------------------------------+--------------------------------------
[rex(rpc-proc)] |
check interval(s) | 10
threshold mem capacity | 33554432
--------------------------------------+--------------------------------------
[cpu] |
enabled/disabled | disabled
check interval(s) | 10
threshold cpu load avg | 5.0
threshold cpu util(%) | 90
--------------------------------------+--------------------------------------
[disk] |
enabled/disalbed | enabled
check interval(s) | 10
threshold disk use(%) | 85
threshold disk util(%) | 90
threshold rkb(kb) | 98304
threshold wkb(kb) | 98304
--------------------------------------+--------------------------------------
Config-3: message-queue
--------------------------------------+--------------------------------------
number of procs/mq | 8
number of batch-procs of msgs | max:3000, regular:1600
interval between batch-procs (ms) | max:3000, regular:500
--------------------------------------+--------------------------------------
Config-4: autonomic operation
--------------------------------------+--------------------------------------
[auto-compaction] |
enabled/disabled | disabled
warning active size ratio (%) | 70
threshold active size ratio (%) | 60
number of parallel procs | 1
exec interval | 3600
--------------------------------------+--------------------------------------
Config-5: data-compaction
--------------------------------------+--------------------------------------
limit of number of compaction procs | 4
number of batch-procs of objs | max:1500, regular:1000
interval between batch-procs (ms) | max:3000, regular:500
--------------------------------------+--------------------------------------
Status-1: RING hash
--------------------------------------+--------------------------------------
current ring hash | c1d863d0
previous ring hash | c1d863d0
--------------------------------------+--------------------------------------
Status-2: Erlang VM
--------------------------------------+--------------------------------------
vm version | 7.3
total mem usage | 158420648
system mem usage | 107431240
procs mem usage | 50978800
ets mem usage | 5926016
procs | 428/1048576
kernel_poll | true
thread_pool_size | 32
--------------------------------------+--------------------------------------
Status-3: Number of messages in MQ
--------------------------------------+--------------------------------------
replication messages | 0
vnode-sync messages | 0
rebalance messages | 0
--------------------------------------+--------------------------------------
To conclude: it seems to me that it's not the queue numbers that are fake and queues are free - there is stuff in queues. Restarting makes them processing again for a while, but pretty soon they get stuck once again. Plus, the situation seems different for storage_1 and storage_2.
@vstax thanks for the detailed info.
No, not that one. "mehtod" part is a typo. From here: https://github.com/leo-project/leo_object_storage/blob/develop/src/leo_object_storage_event.erl#L55
Oops. Got it :)
I've restarted managers, queue size didn't change. I've restarted storage_1 - and the queue started to reduce. I got ~100% CPU usage on that node.
queue size didn't change and displays invalid number that is different from the actual one caused by https://github.com/leo-project/leofs/issues/731.
The CPU usage went to 0, the queue "froze" again. But half a minute error I see the same 100% CPU usage again. Then it goes to 0 again. Then high again. All this time, the "number of msgs" in queue doesn't change, but "interval" number changes:
It seems half a minute error caused by https://github.com/leo-project/leofs/issues/728. Fluctuating the CPU usage from 100 to 0 back and forth repeatedly might imply there are some items that can't be consumed and keep existing for some reason. I will vet in detail.
EDIT: found the fault here: https://github.com/leo-project/leofs/blob/1.3.4/apps/leo_storage/src/leo_storage_mq.erl#L342-L363 After all, there are some items that can't be consumed in QUEUE_ID_ASYNC_DELETION. They keep existing if the target object was already deleted.
After this, at some point the mq-stats command for this node started to work really slow, taking 7-10 seconds to respond, if executed during period of 100% CPU usage.
Getting the response from any command through leofs-adm slow is one of the symptom the Erlang runtime overloaded. If you can reproduce, would you like to execute https://github.com/leo-project/leofs_doctor against the overloaded node? The output would make it easy for us to debug in detail.
To conclude: it seems to me that it's not the queue numbers that are fake and queues are free - there is stuff in queues. Restarting makes them processing again for a while, but pretty soon they get stuck once again. Plus, the situation seems different for storage_1 and storage_2.
It seems the number is fake and there is stuff in queues.
@vstax we will ask you to do the same test after we finish all TODOs described above.
@mocchira
Fix QUEUE_ID_ASYNC_DELETION to consume items properly even if the item was already deleted.
I've recognized leo_storage_mq
has a bug about it when receiving {error, not_found}
.
I'll send a PR and its fix will be included in v1.3.5.
@mocchira
It seems half a minute error caused by #728.
Well, it started to happen before I stopped storage_2 so all the nodes were running; also 30 seconds was just a very rough estimate when looking at top, it could be something in 10-20 second range as well as I wasn't paying strict attention (it wasn't anything less than 10 seconds for sure). I might be misunderstanding #728, though.
After all, there are some items that can't be consumed in QUEUE_ID_ASYNC_DELETION. They keep existing if the target object was already deleted.
Interesting find! I've did tests with double-deletes and deletes of non-existent object before but that was on 1.3.2.1 before changes to queue mechanism.
If you can reproduce, would you like to execute https://github.com/leo-project/leofs_doctor against the overloaded node?
I will. That load (100-0-100-..) on storage_1 has ended around 13:28, when the errors and messages in debug log have stopped appearing. The queue isn't consumed but the node itself is fine.
The storage_2 is still in bad shape, doesn't respond to mq-stats
command and spits out
[E] storage_2@192.168.3.55 2017-05-11 10:43:35.310528 +0300 1494488615 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
every 5 seconds. Also the errors that I've seen on storage_1 never appeared in storage_2 log.
However, when I restart nodes I'll probably see something else.
A question: you've found a case why messages in queue can stop from being processed. But after I restarted node, I clearly had > 1000 messages disappear from queue on each node. Besides lots of cases of double-messages like
info.20170510.13.1:[D] storage_2@192.168.3.55 2017-05-10 13:27:50.400569 +0300 1494412070 leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/6e/15/f6/6e15f6d4febdf823f6f8af7e1f0947ee05a5a905875c3748d12f472831421ce00eefc659d884cc998dadd2bc3d4fc1fd30cc000000000000.xz">>},{req_id,0}]
info.20170510.13.1:[D] storage_2@192.168.3.55 2017-05-10 13:27:50.400833 +0300 1494412070 leo_storage_handler_object:delete/4 596 [{from,leo_mq},{method,del},{key,<<"bodytest/6e/15/f6/6e15f6d4febdf823f6f8af7e1f0947ee05a5a905875c3748d12f472831421ce00eefc659d884cc998dadd2bc3d4fc1fd30cc000000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
there were quite successful deletes like
info.20170510.13.1:[D] storage_2@192.168.3.55 2017-05-10 13:28:40.717438 +0300 1494412120 leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/fc/e3/a3/fce3a3f19655893ef1113627be71afe416987e6770337940e7d533662d7821fa8e74463d4c41ca1fdcd526c6ffb3a14e00ea090000000000.xz">>},{req_id,0}]
info.20170510.13.1:[D] storage_2@192.168.3.55 2017-05-10 13:28:40.719168 +0300 1494412120 leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/f5/6b/01/f56b019f9b473ccb07efbf5091d3ce257b1dcfce862669b2684be231c4f028ce92e8b4fc2dd1ac58248210ac99744ea60018000000000000.xz">>},{req_id,0}]
info.20170510.13.1:[D] storage_2@192.168.3.55 2017-05-10 13:28:40.723881 +0300 1494412120 leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/c4/3c/46/c43c46dd688723e79858c0af76107cc370ad7aebbac60c604de7a8bee450b9b78f3c8222272aefd3bc66579cf3fb12ca10c4000000000000.xz">>},{req_id,0}]
on both storage_1 and storage_2. So somehow node restart makes part of queue to process, even if it wasn't processing when node was running.
I could upload work/queue/4 (around 60 MB for both nodes) somewhere, then restart nodes and see if this (successful processing of part of queue) happens again. Would queue contents help you in debugging?
@yosukehara https://github.com/leo-project/leofs/issues/732
@vstax
Well, it started to happen before I stopped storage_2 so all the nodes were running; also 30 seconds was just a very rough estimate when looking at top, it could be something in 10-20 second range as well as I wasn't paying strict attention (it wasn't anything less than 10 seconds for sure). I might be misunderstanding #728, though.
Since there are multiple consumer processes/files (IIRC, default 4 or 8) per the one queue (in this case ASYNC_DELETION), the period can vary under 30 seconds.
Interesting find! I've did tests with double-deletes and deletes of non-existent object before but that was on 1.3.2.1 before changes to queue mechanism.
Make sense!
I will. That load (100-0-100-..) on storage_1 has ended around 13:28, when the errors and messages in debug log have stopped appearing. The queue isn't consumed but the node itself is fine.
Thanks.
A question: you've found a case why messages in queue can stop from being processed. But after I restarted node, I clearly had > 1000 messages disappear from queue on each node. Besides lots of cases of double-messages like ... on both storage_1 and storage_2. So somehow node restart makes part of queue to process, even if it wasn't processing when node was running.
Seems something I haven't noticed still there. Your queue file might help me debug further.
I could upload work/queue/4 (around 60 MB for both nodes) somewhere, then restart nodes and see if this (successful processing of part of queue) happens again. Would queue contents help you in debugging?
Yes! please share via anything you like. (off topic: previously you shared some stuff via https://cloud.mail.ru/ that was amazingly the fast one I've ever used :)
@mocchira I've packed the queues (the nodes were running but there was no processing) and uploaded them to https://www.dropbox.com/s/78uitcmohhuq3mq/storage-queues.tar.gz?dl=0 (it's not such a big file so dropbox should work, I think?).
Now, after I restarted storage_1.. a miracle! The queue had fully consumed, without any errors in logs or anything. Debug log was as usual:
[D] storage_1@192.168.3.54 2017-05-11 22:44:09.681558 +0300 1494531849 leo_storage_handler_object:delete/4 596 [{from,leo_mq},{method,del},{key,<<"bodytest/76/74/02/767402b5880aa54206793cb197e3fccf4bacf4e516444cd6c88eeea8c9d25af461bb30bcb513041ac033c8db12e7e67e4c09010000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D] storage_1@192.168.3.54 2017-05-11 22:44:09.681905 +0300 1494531849 leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/2a/2e/88/2a2e88feb2ed55c266961a2fcfd80b9f5f02d48fd757e79e3ac9268d1c45139334492579bc98db2e8d53338097239f4e28fe010000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54 2017-05-11 22:44:09.682166 +0300 1494531849 leo_storage_handler_object:delete/4 596 [{from,leo_mq},{method,del},{key,<<"bodytest/2a/2e/88/2a2e88feb2ed55c266961a2fcfd80b9f5f02d48fd757e79e3ac9268d1c45139334492579bc98db2e8d53338097239f4e28fe010000000000.xz">>},{req_id,0},{cause,"Could not get redundancy"}]
[D] storage_1@192.168.3.54 2017-05-11 22:44:09.682426 +0300 1494531849 leo_storage_handler_object:delete/3 582 [{from,leo_mq},{method,del},{key,<<"bodytest/3d/e8/88/3de888009faa04a6860550b94d4bb2f19fe01958ad28229a38bf4eeafd399d5a569d4130b008b48ab6d51889add0aa2e2570010000000000.xz">>},{req_id,0}]
[..skipped..]
[D] storage_1@192.168.3.54 2017-05-11 22:48:41.454128 +0300 1494532121 leo_storage_handler_object:put/4 404 [{from,storage},{method,delete},{key,<<"bodytest/58/15/b6/5815b6600a1d5aa3c46b00dffa3e0a9da7c50f7c75dc4058bbc503f6aca8c74396ce93889a7864ad14207c98445b914da443000000000000.xz">>},{req_id,0}]
[D] storage_1@192.168.3.54 2017-05-11 22:48:41.455928 +0300 1494532121 leo_storage_handler_object:put/4 404 [{from,storage},{method,delete},{key,<<"bodytest/f0/d8/4d/f0d84d4f4b6cb071fb88f3107a00d87be6a849dc304ec7a738c9d7ac4f7e97f7e5ff30a6beff3536fe6267f8af26e57b3ce9000000000000.xz">>},{req_id,0}]
Error log - nothing to show. Queue state during these 4 minutes (removed some extra lines):
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_
leo_async_deletion_queue | running | 51355 | 1600 | 500 | async deletion of objs
leo_async_deletion_queue | running | 46380 | 320 | 900 | async deletion of objs
leo_async_deletion_queue | running | 46200 | 0 | 1000 | async deletion of objs
leo_async_deletion_queue | idling | 37377 | 0 | 1400 | async deletion of objs
leo_async_deletion_queue | idling | 23740 | 0 | 1550 | async deletion of objs
leo_async_deletion_queue | idling | 13480 | 0 | 1750 | async deletion of objs
leo_async_deletion_queue | idling | 1814 | 0 | 2000 | async deletion of objs
leo_async_deletion_queue | idling | 0 | 0 | 2050 | async deletion of objs
I've restarted storage_2 as well. I got no '-decrease/3-lc$^0/1-0-' errors this time! Like, at all. At first queue was processing, then eventually it stuck:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_
leo_async_deletion_queue | running | 135142 | 1440 | 550 | async deletion of objs
leo_async_deletion_queue | running | 131602 | 800 | 750 | async deletion of objs
leo_async_deletion_queue | running | 129353 | 0 | 1000 | async deletion of objs
leo_async_deletion_queue | idling | 129353 | 0 | 1700 | async deletion of objs
leo_async_deletion_queue | idling | 129353 | 0 | 1700 | async deletion of objs
leo_async_deletion_queue | running | 129353 | 0 | 2200 | async deletion of objs
leo_async_deletion_queue | idling | 129353 | 0 | 3000 | async deletion of objs
I got just this in error log around the time it froze:
[W] storage_2@192.168.3.55 2017-05-11 22:48:21.736404 +0300 1494532101 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/b3/64/07/b36407b89a66f64c10e6298af4ba894cd6c2dc501dfd1b65f4567b182777f58c6485e8dc435e19af5b08960aceb946ed289e7d0000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-05-11 22:48:22.733389 +0300 1494532102 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/b3/64/07/b36407b89a66f64c10e6298af4ba894cd6c2dc501dfd1b65f4567b182777f58c6485e8dc435e19af5b08960aceb946ed289e7d0000000000.xz">>},{cause,timeout}]
Now it spends 10-20 seconds in 100% CPU state, then switches back to 0, then 100% again and so on. Just like storage_1 did during the last experiment. And like last time, "mq-stats" makes me wait if executed during 100% CPU usage period. The whole situation seems quite random...
EDIT: 1 hour after experiment, everything still the same; 100% CPU usage alternating with 0% CPU usage. Nothing in error logs (besides some disk watchdog messages as I'm over 80% disk usage on volume with AVS files). This is unlike last experiment (https://github.com/leo-project/leofs/issues/725#issuecomment-300446740) when storage_1 wrote something about "Supervisor started" in logs at some point and stopped consuming CPU soon after.
Leo_doctor logs: https://pastebin.com/y9RgXtEK https://pastebin.com/rsxLCwDN and https://pastebin.com/PMFeRxFH First one, I think it was executed completely during 100% CPU usage period. Second one started during 100% CPU usage and last 3 seconds or so were during near-0% CPU usage period. The third one was without that "expected_svt" option (I don't know the difference so not sure which one you need). The third one started during 100% CPU usage and last 4-5 seconds were during near-0% usage period.
EDIT: 21 hours after experiment, 100% CPU usage alternating with 0% CPU on storage_2 has stopped. Nothing related in logs, really; not in error.log nor in erlang.log. No mention of restarts or anything - just according to sar
at 19:40 on May 12 the load was there, on 19:50 and from that point on it wasn't. The leo_async_deletion_queue queue is unchanged, 129353 / 0 / 3000 messages just like it was at the moment it stopped processing. Just in case, leo_doctor logs from current moment (note that there might be very light load on this node, plus disk space watchdog triggered): https://pastebin.com/iUMn6uLX
@vstax Still WIP although I'd like to share what I got at the moment.
Since the second one can be mitigated by reducing the number of consumers of leo_mq, I will add this workaround to https://github.com/leo-project/leofs/issues/725#issuecomment-300706776.
Design considerations for https://github.com/leo-project/leofs/issues/725#issuecomment-300412054.
I've repeated - or, rather, tried to complete this experiment by re-adding "bodytest" bucket and removing it again on latest dev version. I don't expect it to work perfectly, but wanted to check how issues that were already fixed helped. Debug logs are disabled to make sure leo_logger problems won't affect anything, mq.num_of_mq_procs = 4
is set.
This time, I made sure to abort s3cmd rb s3://bodytest
command after it sent the "remove bucket" request once so that it didn't try to repegat the request or anything. It's exactly the same system, but I estimate that over 60% of original amount of object (1M) are still present in "bodytest" bucket so there was a lot of stuff to remove.
gateway logs:
[W] gateway_0@192.168.3.52 2017-05-26 22:00:26.733769 +0300 1495825226 leo_gateway_s3_api:delete_bucket_2/31798 [{cause,timeout}]
[W] gateway_0@192.168.3.52 2017-05-26 22:00:31.734812 +0300 1495825231 leo_gateway_s3_api:delete_bucket_2/31798 [{cause,timeout}]
storage_0 info log:
[I] storage_0@192.168.3.53 2017-05-26 22:00:27.162670 +0300 1495825227 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,5354}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:34.240077 +0300 1495825234 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7504}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:34.324375 +0300 1495825234 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7162}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:42.957679 +0300 1495825242 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8633}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:43.469667 +0300 1495825243 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,9229}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:50.241744 +0300 1495825250 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7284}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:51.136573 +0300 1495825251 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7667}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:59.20997 +0300 1495825259 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7884}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:59.21352 +0300 1495825259 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz\n1">>},{processing_time,5700}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:20.242104 +0300 1495825280 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:21.264304 +0300 1495825281 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{processing_time,26450}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:38.156285 +0300 1495825298 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,39136}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:38.156745 +0300 1495825298 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/00/e4/ce/00e4ce45fd1c5de4122221d44289f4ace93dc0d046fead4f4d3549b7b756af04621a4ab684a1c7db7b8d5f017555484d90d7000000000000.xz">>},{processing_time,12339}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:38.157114 +0300 1495825298 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/59/49/0f/59490fdb17b17ce75b31909675e7262db9b01a84f04792cbe2f7858d114c48efc5d2f1cf98190dcf9af96a12679cbdccf8e89a0000000000.xz\n2">>},{processing_time,10976}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:38.157429 +0300 1495825298 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/63/29/b9/6329b983cb8e8ea323181e34d2d3b64403ff79671f6850a268406daab8fcf772009d994e804b5fc9611f0773a96d6cde94e3020100000000.xz\n3">>},{processing_time,10018}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:38.158711 +0300 1495825298 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,delete},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{processing_time,16894}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:27.162670 +0300 1495825227 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,5354}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:34.240077 +0300 1495825234 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7504}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:34.324375 +0300 1495825234 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7162}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:42.957679 +0300 1495825242 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8633}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:43.469667 +0300 1495825243 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,9229}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:50.241744 +0300 1495825250 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7284}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:51.136573 +0300 1495825251 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7667}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:59.20997 +0300 1495825259 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7884}]
[I] storage_0@192.168.3.53 2017-05-26 22:00:59.21352 +0300 1495825259 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz\n1">>},{processing_time,5700}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:20.242104 +0300 1495825280 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:21.264304 +0300 1495825281 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{processing_time,26450}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:38.156285 +0300 1495825298 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,39136}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:38.156745 +0300 1495825298 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/00/e4/ce/00e4ce45fd1c5de4122221d44289f4ace93dc0d046fead4f4d3549b7b756af04621a4ab684a1c7db7b8d5f017555484d90d7000000000000.xz">>},{processing_time,12339}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:38.157114 +0300 1495825298 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/59/49/0f/59490fdb17b17ce75b31909675e7262db9b01a84f04792cbe2f7858d114c48efc5d2f1cf98190dcf9af96a12679cbdccf8e89a0000000000.xz\n2">>},{processing_time,10976}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:38.157429 +0300 1495825298 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/63/29/b9/6329b983cb8e8ea323181e34d2d3b64403ff79671f6850a268406daab8fcf772009d994e804b5fc9611f0773a96d6cde94e3020100000000.xz\n3">>},{processing_time,10018}]
[I] storage_0@192.168.3.53 2017-05-26 22:01:38.158711 +0300 1495825298 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,delete},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{processing_time,16894}]
Error log:
[E] storage_0@192.168.3.53 2017-05-26 21:58:54.581809 +0300 1495825134 leo_backend_db_eleveldb:first_n/2 282 {badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.23172>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E] storage_0@192.168.3.53 2017-05-26 21:58:54.582525 +0300 1495825134 leo_mq_server:handle_call/3 287 {badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.23172>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E] storage_0@192.168.3.53 2017-05-26 21:58:54.924670 +0300 1495825134 leo_backend_db_eleveldb:first_n/2 282 {badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.23850>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E] storage_0@192.168.3.53 2017-05-26 21:58:54.927313 +0300 1495825134 leo_mq_server:handle_call/3 287 {badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.23850>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E] storage_0@192.168.3.53 2017-05-26 21:58:55.42756 +0300 1495825135 leo_backend_db_eleveldb:first_n/2 282 {badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.24459>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[E] storage_0@192.168.3.53 2017-05-26 21:58:55.43297 +0300 1495825135 leo_mq_server:handle_call/3 287 {badarg,[{eleveldb,async_iterator,[#Ref<0.0.6029313.24459>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,373}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,273}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,335}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
[W] storage_0@192.168.3.53 2017-05-26 22:01:24.864263 +0300 1495825284 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-05-26 22:01:25.816594 +0300 1495825285 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/18/3e/75/183e754dac629b6fef37d66e7f76d2dd361b88683b7dc20ff4a798864659c49fb7a09161fc9c6a7c7a26ee0b1912391e8466000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-05-26 22:02:08.375387 +0300 1495825328 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-05-26 22:02:09.382327 +0300 1495825329 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{cause,timeout}]
storage_1 info log:
[I] storage_1@192.168.3.54 2017-05-26 22:00:26.993464 +0300 1495825226 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,5221}]
[I] storage_1@192.168.3.54 2017-05-26 22:00:34.450344 +0300 1495825234 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,7456}]
[I] storage_1@192.168.3.54 2017-05-26 22:00:34.899198 +0300 1495825234 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8167}]
[I] storage_1@192.168.3.54 2017-05-26 22:00:34.900451 +0300 1495825234 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/9f/c1/62/9fc1621aacf06357ccd85ce7e43f4dc17eafe60bce3cb8cf864487f61ea4667ac7eded91411ee9e1fc0b7180119f29670400000000000000.xz">>},{processing_time,5424}]
[I] storage_1@192.168.3.54 2017-05-26 22:00:46.351992 +0300 1495825246 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,11453}]
[I] storage_1@192.168.3.54 2017-05-26 22:00:46.352702 +0300 1495825246 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/92/e4/fc/92e4fcc551dc03361f41f59e37eca7161d4dfb23fff803200bf1b990a2e81e0ed909d74c2613e90259c48c4a385702b1dc51010000000000.xz">>},{processing_time,8778}]
[I] storage_1@192.168.3.54 2017-05-26 22:01:00.258646 +0300 1495825260 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,25808}]
[I] storage_1@192.168.3.54 2017-05-26 22:01:00.259186 +0300 1495825260 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,15039}]
[I] storage_1@192.168.3.54 2017-05-26 22:01:21.291575 +0300 1495825281 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,34940}]
[I] storage_1@192.168.3.54 2017-05-26 22:01:21.292084 +0300 1495825281 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,29112}]
[I] storage_1@192.168.3.54 2017-05-26 22:01:21.292789 +0300 1495825281 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/2b/5c/f3/2b5cf31eaffd8e884937240a026abec1c6a48f66b042c08cca9b80250e9a58dd2216871bdb0dddbbaae4d6e7eb0896538498000000000000.xz">>},{processing_time,5069}]
[I] storage_1@192.168.3.54 2017-05-26 22:01:21.294835 +0300 1495825281 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,delete},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,21036}]
[I] storage_1@192.168.3.54 2017-05-26 22:01:30.189080 +0300 1495825290 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,29930}]
[I] storage_1@192.168.3.54 2017-05-26 22:01:30.189895 +0300 1495825290 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/01/9c/aa/019caaf22c84f6e77c5f5597810faa55ef57c71a38a133cbe9d38c631e40d11434ff449f989d77d408571af4c06e11aeb475000000000000.xz">>},{processing_time,28628}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:00.189674 +0300 1495825320 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:08.370447 +0300 1495825328 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{processing_time,30001}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.574818 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/0f/df/87/0fdf870a9a237805c0282ba71c737966f2630124921b5c8709b6f470754b3e187eebdd30e80d404ccb700be646bc3c03bfa6020100000000.xz\n1">>},{processing_time,29259}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.575370 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/12/ed/21/12ed21380e8cb085deb10aa161feb131b581553ab1ead52e24ed88619b2ec7709d59b9e69b3d7bb0febc5930048bb1a0d8a2020100000000.xz\n2">>},{processing_time,17599}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.575744 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/29/95/03/2995035be6f7fbe86d6f4f76eba845bfc50338bd40535d9947e473779538a5ba6de5534672c3b5146fb5768b9e905a4318fa7b0000000000.xz\n1">>},{processing_time,40674}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.576122 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/4d/93/47/4d934795bb3006b7e35d99ca7485bfaa1b9cc1b8878fe11f260e0ffedb8e1d97f66221bfbb048ac5ce8298ae93e922be46e8020100000000.xz\n1">>},{processing_time,24915}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.576518 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/59/89/e7/5989e7825beeb82933706f559ab737cfe0eb88156471a29e0c6f6ae04c00576f0b0c5462f6714d2387a1856f99cdf3fc89ab040100000000.xz\n1">>},{processing_time,9456}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.576883 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/2f/99/d1/2f99d1ffa377ceda4341d1c0a85647f17fade7e8e375eafb1b8e1a17bd794fa9683a0546ed594ce2a18944c3e817498f00821a0100000000.xz\n1">>},{processing_time,38070}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.578804 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/63/7f/75/637f7568ee27aa13f0ccabc34d68faac4500535cb4c3f34b4b5d4349d80a6a96de46bcc04522f76debd1060647083a4850955c0000000000.xz\n1">>},{processing_time,8954}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.579637 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/07/7c/11/077c11796ee67c7a15027cf21b749ffbfd244c06980bf98a945acdd92b3404feb56609b8a0b177cd205d309e0d8310a6b0df5b0000000000.xz\n1">>},{processing_time,37267}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.580231 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/49/c8/3f/49c83ff8341d50259f4138707688613860802327ebb2e75d9019bda193c8ab82a3b66b4f7e92d4d9dc0f3d39c082010e5694370100000000.xz\n2">>},{processing_time,35610}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.581187 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/1a/9b/07/1a9b073aafa182620e4bb145507a097320bb4097ebec0dfddee3936a96e0cb83fc10ed7a7bfcd3f20456a3cdf0a373be3026700000000000.xz\n1">>},{processing_time,35500}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.581593 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/22/8c/64/228c648d769e51472f79670cbb804f9bee23d8d9ea6612ee4a21ea11b901ef60732e3657a2e4fb68ce26b745525ada7ab0b5790000000000.xz\n1">>},{processing_time,34513}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.582178 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/42/29/fb/4229fb92dac335eb214e0eef2d2bd59d25685ae9ace816f44eb4d37147921ad66b5be7ccc97938aacfdfc64c1e721f1ed2a1020100000000.xz\n2">>},{processing_time,20930}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.582963 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/14/d7/9b/14d79b2fd3b666cf511d1c4e55dddf2b44f998312dc0103cd26dd7227dba14ce0ddfe0e8e87a64d30e49f788081cd75a39bc000000000000.xz">>},{processing_time,14189}]
[I] storage_1@192.168.3.54 2017-05-26 22:02:23.583762 +0300 1495825343 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/39/b5/e3/39b5e371ed1f857e881725e5b491810862a291268efb395e948da6d83934bc19d3ef8fc7c5a9584bcd18bd174c3e080dfba2020100000000.xz\n1">>},{processing_time,13991}]
Error log:
[W] storage_1@192.168.3.54 2017-05-26 22:01:15.220528 +0300 1495825275 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-26 22:01:16.221833 +0300 1495825276 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{cause,timeout}]
storage_2 info log:
[I] storage_2@192.168.3.55 2017-05-26 22:00:34.873903 +0300 1495825234 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8110}]
[I] storage_2@192.168.3.55 2017-05-26 22:00:35.352063 +0300 1495825235 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,8615}]
[I] storage_2@192.168.3.55 2017-05-26 22:00:35.359634 +0300 1495825235 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/9f/c1/62/9fc1621aacf06357ccd85ce7e43f4dc17eafe60bce3cb8cf864487f61ea4667ac7eded91411ee9e1fc0b7180119f29670400000000000000.xz">>},{processing_time,5868}]
[I] storage_2@192.168.3.55 2017-05-26 22:00:46.957075 +0300 1495825246 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,11605}]
[I] storage_2@192.168.3.55 2017-05-26 22:00:46.958526 +0300 1495825246 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/92/e4/fc/92e4fcc551dc03361f41f59e37eca7161d4dfb23fff803200bf1b990a2e81e0ed909d74c2613e90259c48c4a385702b1dc51010000000000.xz">>},{processing_time,9393}]
[I] storage_2@192.168.3.55 2017-05-26 22:00:46.958917 +0300 1495825246 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz\n2">>},{processing_time,7222}]
[I] storage_2@192.168.3.55 2017-05-26 22:01:04.874732 +0300 1495825264 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I] storage_2@192.168.3.55 2017-05-26 22:01:05.757004 +0300 1495825265 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/15/2f/82disk_usage/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,20530}]
[I] storage_2@192.168.3.55 2017-05-26 22:01:18.153498 +0300 1495825278 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,31196}]
[I] storage_2@192.168.3.55 2017-05-26 22:01:18.154052 +0300 1495825278 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,head},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,25974}]
[I] storage_2@192.168.3.55 2017-05-26 22:01:18.159729 +0300 1495825278 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{method,delete},{key,<<"bodytest/15/2f/82/152f825e8fe6bfe5546cd463005871b5aa45abdcfd37b3457d58fbf9c0da8a1f993665cb6a68db8d624cd8a47f0d5e078656000000000000.xz">>},{processing_time,12403}]
Error log - empty.
Queue states: For storage_0, within 30 seconds after delete bucket operation, the queue has reached this number:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_0@192.168.3.53|grep leo_async_deletion
leo_async_deletion_queue | idling | 80439 | 1600 | 500 | async deletion of objs
which was dropping pretty fast
leo_async_deletion_queue | running | 25950 | 3000 | 0 | async deletion of objs
and reached 0 like 2-3 minutes after start of operation:
leo_async_deletion_queue | idling | 0 | 1600 | 500 | async deletion of objs
For storage_1, likewise the queue got to this number within 30 seconds, but its status was "suspending"
leo_async_deletion_queue | suspending | 171957 | 0 | 1700 | async deletion of objs
it was "suspending" all the time during experiment. It barely dropped and stays at this number even now:
leo_async_deletion_queue | suspending | 170963 | 0 | 1500 | async deletion of objs
For storage_2, the number was this within 30 seconds after start
leo_async_deletion_queue | idling | 34734 | 0 | 2400 | async deletion of objs
it was dropping slowly (quite unlike storage_0) and has reached this number when it stopped reducing:
leo_async_deletion_queue | idling | 29448 | 0 | 3000 | async deletion of objs
At this point the system is stable; nothing going on, there is no load, most of objects from "bodytest" still aren't removed, the queues for storage_1 and storage_2 are stalled with the numbers like above. There is nothing else in log files.
I stop storage_2, make backup of its queues (just in case). I start it, the number in queue is the same at first. 20-30 seconds after node is started, it starts to reduce. There are new messages in error logs:
[W] storage_2@192.168.3.55 2017-05-26 22:36:53.397898 +0300 1495827413 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-05-26 22:36:54.377776 +0300 1495827414 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{cause,timeout}]
The number in queue eventually reduces to 0.
I stop storage_1, make backup of its queues, start it again. Right after start the queue starts processing:
leo_async_deletion_queue | running | 168948 | 1600 | 500 | async deletion of objs
Then CPU load on node goes very high, "mq-stats" command starts to hang, I see this in error logs:
[W] storage_1@192.168.3.54 2017-05-26 22:49:59.603367 +0300 1495828199 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/3d/8c/78/3d8c78839ebc79cba43a1b57f138e1e3d4c422269f8aa522a55242b49cdc2ffca756d4d799f58dc0b6009f0f2e7a4638482a680000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-26 22:50:00.600085 +0300 1495828200 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/3d/8c/78/3d8c78839ebc79cba43a1b57f138e1e3d4c422269f8aa522a55242b49cdc2ffca756d4d799f58dc0b6009f0f2e7a4638482a680000000000.xz">>},{cause,timeout}]
[E] storage_1@192.168.3.54 2017-05-26 22:50:52.705543 +0300 1495828252 leo_mq_server:handle_call/3 287 {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,{first_n,0},30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:50:53.757233 +0300 1495828253 null:null 0 gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:50:54.262824 +0300 1495828254 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-26 22:50:54.264280 +0300 1495828254 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.316.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:51:03.461439 +0300 1495828263 null:null 0 gen_fsm leo_async_deletion_queue_consumer_1_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:51:03.461919 +0300 1495828263 null:null 0 gen_fsm leo_async_deletion_queue_consumer_4_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:51:03.462275 +0300 1495828263 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-26 22:51:03.462926 +0300 1495828263 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.318.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:51:03.481700 +0300 1495828263 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_4_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-26 22:51:03.482332 +0300 1495828263 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.312.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:51:24.823088 +0300 1495828284 null:null 0 gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:51:24.823534 +0300 1495828284 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-26 22:51:24.825905 +0300 1495828284 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.20880.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:51:34.85988 +0300 1495828294 null:null 0 gen_fsm leo_async_deletion_queue_consumer_4_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:51:34.87305 +0300 1495828294 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_4_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-26 22:51:34.95578 +0300 1495828294 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.27909.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:51:34.522235 +0300 1495828294 null:null 0 gen_fsm leo_async_deletion_queue_consumer_1_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:51:34.525223 +0300 1495828294 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-26 22:51:34.539198 +0300 1495828294 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.27892.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:51:34.539694 +0300 1495828294 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.27892.1> exit with reason reached_max_restart_intensity in context shutdown
[E] storage_1@192.168.3.54 2017-05-26 22:51:34.541076 +0300 1495828294 null:null 0 Supervisor leo_redundant_manager_sup had child undefined started with leo_mq_sup:start_link() at <0.220.0> exit with reason shutdown in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:51:34.976730 +0300 1495828294 null:null 0 Error in process <0.20995.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:35.122748 +0300 1495828295 null:null 0 Error in process <0.20996.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:35.140676 +0300 1495828295 null:null 0 Error in process <0.20997.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:35.211716 +0300 1495828295 null:null 0 Error in process <0.20998.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:35.367975 +0300 1495828295 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:36.17706 +0300 1495828296 null:null 0 Error in process <0.21002.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:36.68751 +0300 1495828296 null:null 0 Error in process <0.21005.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:36.273259 +0300 1495828296 null:null 0 Error in process <0.21011.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:37.246142 +0300 1495828297 null:null 0 Error in process <0.21018.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:37.625651 +0300 1495828297 null:null 0 Error in process <0.21022.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:38.192580 +0300 1495828298 null:null 0 Error in process <0.21024.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:38.461708 +0300 1495828298 null:null 0 Error in process <0.21025.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:38.462431 +0300 1495828298 null:null 0 Error in process <0.21026.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:39.324727 +0300 1495828299 null:null 0 Error in process <0.21033.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:39.851241 +0300 1495828299 null:null 0 Error in process <0.21043.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:40.5627 +0300 1495828300 null:null 0 Error in process <0.21049.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:40.369284 +0300 1495828300 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:40.523795 +0300 1495828300 null:null 0 Error in process <0.21050.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:41.56663 +0300 1495828301 null:null 0 Error in process <0.21052.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:41.317741 +0300 1495828301 null:null 0 Error in process <0.21057.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:42.785978 +0300 1495828302 null:null 0 Error in process <0.21069.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:42.812650 +0300 1495828302 null:null 0 Error in process <0.21070.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:42.984686 +0300 1495828302 null:null 0 Error in process <0.21071.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:43.815766 +0300 1495828303 null:null 0 Error in process <0.21078.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:44.817129 +0300 1495828304 null:null 0 Error in process <0.21085.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:45.370117 +0300 1495828305 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:46.199487 +0300 1495828306 null:null 0 Error in process <0.21097.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:46.502452 +0300 1495828306 null:null 0 Error in process <0.21099.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:47.770769 +0300 1495828307 null:null 0 Error in process <0.21103.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:47.987768 +0300 1495828307 null:null 0 Error in process <0.21108.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:48.516769 +0300 1495828308 null:null 0 Error in process <0.21112.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:48.524799 +0300 1495828308 null:null 0 Error in process <0.21113.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:48.813618 +0300 1495828308 null:null 0 Error in process <0.21114.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:50.370898 +0300 1495828310 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:50.872671 +0300 1495828310 null:null 0 Error in process <0.21136.2> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:51:55.372095 +0300 1495828315 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:52:00.373178 +0300 1495828320 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:52:05.373913 +0300 1495828325 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:52:10.375174 +0300 1495828330 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:52:15.375872 +0300 1495828335 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:52:20.376915 +0300 1495828340 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:52:25.377929 +0300 1495828345 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:52:30.378945 +0300 1495828350 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:52:35.379846 +0300 1495828355 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:52:40.381247 +0300 1495828360 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:52:45.381901 +0300 1495828365 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:52:50.383154 +0300 1495828370 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
The node isn't working at this point. I restart it, queue starts to process:
leo_async_deletion_queue | running | 122351 | 160 | 950 | async deletion of objs
Error log is typical at first:
[W] storage_1@192.168.3.54 2017-05-26 22:54:44.83565 +0300 1495828484 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/d3/71/30/d37130689a5bb04e1270e85a0442d9944112eb84949360e0732c7313b91eaaf1ccbce0e74a0b9f88917377fe9d08127c38935f0000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-26 22:54:44.690582 +0300 1495828484 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/dc/ad/01/dcad01a27ba985514931ae379940fcd8021ecaab7d47e948eb41b3bdeac6808c305a2fd15fbe015dc4c2a542be000846107f830000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-26 22:54:45.79657 +0300 1495828485 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/d3/71/30/d37130689a5bb04e1270e85a0442d9944112eb84949360e0732c7313b91eaaf1ccbce0e74a0b9f88917377fe9d08127c38935f0000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-26 22:54:45.689791 +0300 1495828485 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/dc/ad/01/dcad01a27ba985514931ae379940fcd8021ecaab7d47e948eb41b3bdeac6808c305a2fd15fbe015dc4c2a542be000846107f830000000000.xz">>},{cause,timeout}]
but then mq-stats command starts to freeze, and I get this:
[E] storage_1@192.168.3.54 2017-05-26 22:55:35.421877 +0300 1495828535 null:null 0 gen_fsm leo_async_deletion_queue_consumer_4_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:55:35.670420 +0300 1495828535 leo_mq_server:handle_call/3 287 {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,{first_n,0},30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:55:35.851418 +0300 1495828535 null:null 0 gen_fsm leo_async_deletion_queue_consumer_3_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:55:35.964534 +0300 1495828535 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_3_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-26 22:55:35.966858 +0300 1495828535 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_4_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-26 22:55:35.967659 +0300 1495828535 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.331.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:55:35.968591 +0300 1495828535 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.329.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:55:40.252705 +0300 1495828540 null:null 0 gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:55:40.273471 +0300 1495828540 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-26 22:55:40.274015 +0300 1495828540 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.333.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:55:45.382167 +0300 1495828545 null:null 0 gen_fsm leo_async_deletion_queue_consumer_1_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:55:45.383698 +0300 1495828545 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-26 22:55:45.384491 +0300 1495828545 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.335.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:56:06.248006 +0300 1495828566 null:null 0 gen_fsm leo_async_deletion_queue_consumer_4_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:56:06.248610 +0300 1495828566 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_4_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-26 22:56:06.249397 +0300 1495828566 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.14436.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:56:06.618153 +0300 1495828566 null:null 0 gen_fsm leo_async_deletion_queue_consumer_3_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:56:06.618743 +0300 1495828566 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_3_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-26 22:56:06.619501 +0300 1495828566 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.14435.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:56:06.619996 +0300 1495828566 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.14435.0> exit with reason reached_max_restart_intensity in context shutdown
[E] storage_1@192.168.3.54 2017-05-26 22:56:06.620377 +0300 1495828566 null:null 0 Supervisor leo_redundant_manager_sup had child undefined started with leo_mq_sup:start_link() at <0.222.0> exit with reason shutdown in context child_terminated
[E] storage_1@192.168.3.54 2017-05-26 22:56:07.236507 +0300 1495828567 null:null 0 Error in process <0.14718.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:07.395666 +0300 1495828567 null:null 0 Error in process <0.14719.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:07.589406 +0300 1495828567 null:null 0 Error in process <0.14721.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:08.34491 +0300 1495828568 null:null 0 Error in process <0.14722.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:08.553459 +0300 1495828568 null:null 0 Error in process <0.14724.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:08.699552 +0300 1495828568 null:null 0 Error in process <0.14726.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:08.750870 +0300 1495828568 null:null 0 Error in process <0.14727.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:09.395709 +0300 1495828569 null:null 0 Error in process <0.14741.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:09.429783 +0300 1495828569 null:null 0 Error in process <0.14742.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:09.536674 +0300 1495828569 null:null 0 Error in process <0.14743.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:09.670552 +0300 1495828569 null:null 0 Error in process <0.14748.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:10.239008 +0300 1495828570 null:null 0 Error in process <0.14754.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:10.395451 +0300 1495828570 null:null 0 Error in process <0.14755.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:10.872669 +0300 1495828570 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:11.79527 +0300 1495828571 null:null 0 Error in process <0.14758.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:11.89153 +0300 1495828571 null:null 0 Error in process <0.14760.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:11.93206 +0300 1495828571 null:null 0 Error in process <0.14761.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:11.291948 +0300 1495828571 null:null 0 Error in process <0.14762.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:11.336069 +0300 1495828571 null:null 0 Error in process <0.14763.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:11.608531 +0300 1495828571 null:null 0 Error in process <0.14769.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:12.78531 +0300 1495828572 null:null 0 Error in process <0.14770.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:12.461563 +0300 1495828572 null:null 0 Error in process <0.14772.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:12.689473 +0300 1495828572 null:null 0 Error in process <0.14773.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:12.812491 +0300 1495828572 null:null 0 Error in process <0.14774.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:14.902513 +0300 1495828574 null:null 0 Error in process <0.14793.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:15.250434 +0300 1495828575 null:null 0 Error in process <0.14800.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:15.266418 +0300 1495828575 null:null 0 Error in process <0.14801.0> on node 'storage_1@192.168.3.54' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] storage_1@192.168.3.54 2017-05-26 22:56:15.873589 +0300 1495828575 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
(the last line starts to repeat at this point)
Note that the empty lines in log file are really there.
In other words, the current problems are:
1) Queue processing is still freezing without any directly related errors (as shown by storage_2)
2) Something scary is going on storage_1 (EDIT: fixed names of nodes).
3) The delete process isn't finished, there are still objects in bucket (however, I assume this is expected, given badargs in eleveldb,async_iterator
early on)
Problem 3) is probably fine for now given that https://github.com/leo-project/leofs/issues/725#issuecomment-302606104 isn't implemented yet, I suppose, but 1) and 2) worry me as I don't see any currently open issues related to these problems...
I've updated the diagram of the deletion bucket processing, which covers https://github.com/leo-project/leofs/issues/725#issuecomment-302606104
@yosukehara Thanks for updating and taking my comments into account.
Some comments.
Check the state of the deletion bucket
Let me confirm about How to handle multiple delete-bucket requests in parallel mentioned at the above comment.
Let me confirm Q3 (async_deletion) is the one that is already existing?
How about Priority on background jobs(BJ) mentioned at the above comment?
It seems the other concerns have been covered by the above diagram. Thanks for your hard work.
@vstax thanks for testing.
As you concerned, the problem 1, 2 give me the impression there are something we have not covered yet. I will dig in further later (Now I have one hypothesis that could explain problem 1, 2)
Note: If you find reached_max_restart_intensity in an error.log then that means something goes pretty bad (Some erlang processes that are suppose to exist go down permanently due to the number of restarts reached a certain threshold in a specific time). Please restart a server if you face such cases in production. we'd like to tackle this problem somehow (like restarting automatically without human intervention) as another issue.
@vstax I guess #744 can be the root cause of problem 1, 2 here so please try the same test with the latest leo_mq if you can spare time?
@mocchira I did the tests with leo_mq "devel" version. Unfortunately, results are still not good (but maybe better than before? I feel like it's somewhat better, at least logs seem to be less scary, but it might be because of something else).
At first I was restarting cluster. I upgraded and restarted storage_2 after storage_1 which was already working on new version. So as storage_1 tried to process its queue, it got into problems because of #728:
[W] storage_1@192.168.3.54 2017-05-30 18:59:24.508983 +0300 1496159964 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',{'EXIT',{badarg,[{ets,lookup,[leo_env_values,{env,leo_redundant_manager,server_type}],[]},{leo_misc,get_env,3,[{file,"src/leo_misc.erl"},{line,127}]},{leo_redundant_manager_api,table_info,1,[{file,"src/leo_redundant_manager_api.erl"},{line,1218}]},{leo_redundant_manager_api,checksum,1,[{file,"src/leo_redundant_manager_api.erl"},{line,368}]},{rpc,'-handle_call_call/6-fun-0-',5,[{file,"rpc.erl"},{line,206}]}]}}}
[W] storage_1@192.168.3.54 2017-05-30 18:59:29.546031 +0300 1496159969 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54 2017-05-30 18:59:34.549199 +0300 1496159974 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54 2017-05-30 18:59:39.566360 +0300 1496159979 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54 2017-05-30 18:59:44.571165 +0300 1496159984 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54 2017-05-30 18:59:49.576512 +0300 1496159989 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54 2017-05-30 18:59:51.140253 +0300 1496159991 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/34/33/8b/34338b4623dbdc08681b9d4c2697835cc8d5dba2046342060b1acdbf3c90308c98f3adb2de4d65a467a9d75591cd3e6db0c27b0000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-30 18:59:52.137261 +0300 1496159992 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/34/33/8b/34338b4623dbdc08681b9d4c2697835cc8d5dba2046342060b1acdbf3c90308c98f3adb2de4d65a467a9d75591cd3e6db0c27b0000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-30 18:59:53.287177 +0300 1496159993 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/10/0f/24/100f2430737d73d12a8c26fdb3cf2cae1fd5c2d3a72e11889842ccaa62ee59be0ddb3a9634110c65672b230966f898d47ca7000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-30 18:59:54.272404 +0300 1496159994 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/10/0f/24/100f2430737d73d12a8c26fdb3cf2cae1fd5c2d3a72e11889842ccaa62ee59be0ddb3a9634110c65672b230966f898d47ca7000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-30 19:00:04.629192 +0300 1496160004 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54 2017-05-30 19:00:09.635931 +0300 1496160009 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[E] storage_1@192.168.3.54 2017-05-30 19:00:42.642626 +0300 1496160042 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.323.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-30 19:00:42.649184 +0300 1496160042 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_4_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 4) at <0.319.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-30 19:00:45.46415 +0300 1496160045 null:null 0 gen_fsm leo_async_deletion_queue_consumer_3_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-30 19:00:45.46836 +0300 1496160045 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_3_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-30 19:00:45.47366 +0300 1496160045 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.321.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[W] storage_1@192.168.3.54 2017-05-30 19:00:47.401982 +0300 1496160047 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[E] storage_1@192.168.3.54 2017-05-30 19:00:52.385800 +0300 1496160052 null:null 0 gen_fsm leo_async_deletion_queue_consumer_1_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-30 19:00:52.386242 +0300 1496160052 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-30 19:00:52.386837 +0300 1496160052 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.325.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[W] storage_1@192.168.3.54 2017-05-30 19:01:07.572285 +0300 1496160067 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54 2017-05-30 19:01:12.584661 +0300 1496160072 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54 2017-05-30 19:01:17.630653 +0300 1496160077 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54 2017-05-30 19:01:22.636858 +0300 1496160082 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[W] storage_1@192.168.3.54 2017-05-30 19:01:37.729682 +0300 1496160097 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
[skipped]
[W] storage_1@192.168.3.54 2017-05-30 19:02:12.840406 +0300 1496160132 leo_membership_cluster_local:compare_with_remote_chksum/3 405 {'storage_2@192.168.3.55',nodedown}
during which the queue was barely consuming, this change took over a minute:
leo_async_deletion_queue | running | 121462 | 1600 | 500 | async deletion of objs
leo_async_deletion_queue | running | 121462 | 1600 | 500 | async deletion of objs
leo_async_deletion_queue | running | 121462 | 1600 | 500 | async deletion of objs
leo_async_deletion_queue | running | 121283 | 640 | 800 | async deletion of objs
leo_async_deletion_queue | running | 121283 | 320 | 900 | async deletion of objs
leo_async_deletion_queue | running | 121283 | 160 | 950 | async deletion of objs
This is, of course, a known problem; I only wanted to note that until #728 is fixed, doing "delete bucket" operation with even a single node down - or restarting node, if the delete process is going on - seems to be extremely problematic.
After that at 19:02 storage_2 had launched, at which point the queue processing stopped entirely! There was nothing in error log anymore. I waited for 3 minutes or so, nothing was going on at all. The queue didn't want to process:
leo_async_deletion_queue | idling | 121227 | 0 | 1400 | async deletion of objs
leo_async_deletion_queue | running | 121227 | 0 | 1950 | async deletion of objs
leo_async_deletion_queue | idling | 121227 | 0 | 2400 | async deletion of objs
leo_async_deletion_queue | idling | 121227 | 0 | 2650 | async deletion of objs
This is a problem, of course; no idea if it's yet another one or the same (queue processing freezed), but this time caused by another node being down for some time.
I restarted storage_1 at this point (~ at 19:05:03). The queue started to process (I was checking mq-stats every 5-10 seconds):
leo_async_deletion_queue | running | 120390 | 1600 | 500 | async deletion of objs
leo_async_deletion_queue | running | 119837 | 1600 | 500 | async deletion of objs
leo_async_deletion_queue | running | 118829 | 1600 | 500 | async deletion of objs
leo_async_deletion_queue | running | 115301 | 1280 | 600 | async deletion of objs
leo_async_deletion_queue | running | 113549 | 1120 | 650 | async deletion of objs
leo_async_deletion_queue | running | 112752 | 1120 | 650 | async deletion of objs
leo_async_deletion_queue | running | 110715 | 960 | 700 | async deletion of objs
leo_async_deletion_queue | running | 107220 | 480 | 850 | async deletion of objs
leo_async_deletion_queue | running | 106909 | 320 | 900 | async deletion of objs
leo_async_deletion_queue | running | 106429 | 320 | 900 | async deletion of objs
around this time "mq-stats" started to freeze (that is, 5-7 seconds before I got a reply). Short time after, this appeared in error log:
[E] storage_1@192.168.3.54 2017-05-30 19:06:45.586377 +0300 1496160405 leo_mq_server:handle_call/3 285 {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,{first_n,0},30000]}}
[E] storage_1@192.168.3.54 2017-05-30 19:06:45.587920 +0300 1496160405 null:null 0 gen_fsm leo_async_deletion_queue_consumer_2_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-30 19:06:45.618785 +0300 1496160405 null:null 0 gen_fsm leo_async_deletion_queue_consumer_3_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-30 19:06:46.448213 +0300 1496160406 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-30 19:06:46.450600 +0300 1496160406 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_3_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-30 19:06:46.451757 +0300 1496160406 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.313.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-30 19:06:46.453556 +0300 1496160406 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_3_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 3) at <0.311.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-30 19:06:55.715273 +0300 1496160415 null:null 0 gen_fsm leo_async_deletion_queue_consumer_1_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-30 19:06:55.715597 +0300 1496160415 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-30 19:06:55.716086 +0300 1496160415 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.315.0> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
something apparently restarted in the node at this point, when ~ 20 second later node started to respond again and I was able to get next mq-stats results, the queue was consuming once again:
leo_async_deletion_queue | running | 91788 | 1120 | 650 | async deletion of objs
leo_async_deletion_queue | running | 83837 | 640 | 800 | async deletion of objs
leo_async_deletion_queue | running | 81511 | 480 | 850 | async deletion of objs
leo_async_deletion_queue | running | 76937 | 160 | 950 | async deletion of objs
Here mq-stats started to freeze again. Few seconds later this appeared in log file:
[E] storage_1@192.168.3.54 2017-05-30 19:08:13.853216 +0300 1496160493 leo_mq_server:handle_call/3 285 {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,{first_n,0},30000]}}
[E] storage_1@192.168.3.54 2017-05-30 19:08:14.133559 +0300 1496160494 null:null 0 gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-30 19:08:14.134668 +0300 1496160494 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-30 19:08:14.145193 +0300 1496160494 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_2_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 2) at <0.11208.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-05-30 19:08:23.855101 +0300 1496160503 null:null 0 gen_fsm leo_async_deletion_queue_consumer_1_1 in state running terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-30 19:08:23.855684 +0300 1496160503 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_1_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
[E] storage_1@192.168.3.54 2017-05-30 19:08:23.856325 +0300 1496160503 null:null 0 Supervisor leo_mq_sup had child undefined started with leo_mq_consumer:start_link(leo_async_deletion_queue_consumer_1_1, leo_async_deletion_queue, {mq_properties,leo_async_deletion_queue,undefined,leo_storage_mq,leveldb,4,1,"/usr/local/leofs/w...",...}, 1) at <0.15227.1> exit with reason {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}} in context child_terminated
After short time, something restarted again and queue started to consume for a short while, but then stopped:
leo_async_deletion_queue | running | 68152 | 1280 | 600 | async deletion of objs
leo_async_deletion_queue | running | 59989 | 0 | 2100 | async deletion of objs
leo_async_deletion_queue | suspending | 59989 | 0 | 1250 | async deletion of objs
leo_async_deletion_queue | suspending | 59989 | 0 | 1250 | async deletion of objs
leo_async_deletion_queue | idling | 59989 | 0 | 1550 | async deletion of objs
leo_async_deletion_queue | idling | 59989 | 0 | 1550 | async deletion of objs
leo_async_deletion_queue | idling | 59989 | 0 | 1850 | async deletion of objs
leo_async_deletion_queue | idling | 59989 | 0 | 2500 | async deletion of objs
This time queue stopped processing for good, and there were no errors in logs at all. There is nothing in error logs of other nodes as well (except for obvious errors when I restarted storage_1). One thing that I'm noticing is that it generally takes roughly the same time (1:10 - 1:30) every time before problem appears.
@vstax Thanks for the additional try. It seems leo_backend_db used by leo_mq got stuck for some reason. Besides normal operations, one possible reason causing leo_backend_db to get stuck is too much orders coming from leo_watchdog like
In order to confirm the above assumption is correct, please do the same test with leo_watchdog disabled? I will vet other possibilities.
also https://github.com/leo-project/leofs/issues/746 should mitigate this problem so please try after the PR for #746 is merged into develop.
Edit: Now the PR for #746 already got merged into develop.
@mocchira Thank you for advice! The results are amazing, the problems 1 and 2 are gone completely once I turned off disk watchdog. I remember watchdog giving problem in the past during upload experiments, but once you've implemented https://github.com/leo-project/leo_watchdog/commit/8a30a1730ea376439b6764e02d9c875996629d39 they disappeared and I was able to do massive parallel uploads with watchdog enabled, so I kind of left it like that, thinking it shouldn't create any more problems. Apparently I was wrong, it affects these massive deletes as well.
I should note that disk utilization watchdog never triggered, but I had disk capacity watchdog trigger all the time on storage_2 and storage_1 (I tried moving thresholds before to get rid of it, but it didn't work. I've filled #747 about it now).
With disk watchdog disabled (note that I didn't try the very latest leo_mq with fix for #746 yet) the queues are processed very smoothly and always to 0. They only show states "running" and "idling" now and batch size / interval don't fluctuate anymore, always being 1600 / 500.
As interesting note, badarg from eleveldb which happened before soon after initial "delete bucket" request - after which delete queues stopped growing and started to consume - doesn't happen anymore as well. But the queues still behave like that, they grow fast to certain point (70-90K messages on each node), then start consuming. So apparently that badarg wasn't the cause of why "delete bucket" operation stops adding messages to queue, it must be something else. The only errors now are, on gateway:
[W] gateway_0@192.168.3.52 2017-05-31 18:39:02.156972 +0300 1496245142 leo_gateway_s3_api:delete_bucket_2/31798 [{cause,timeout}]
[W] gateway_0@192.168.3.52 2017-05-31 18:39:07.158998 +0300 1496245147 leo_gateway_s3_api:delete_bucket_2/31798 [{cause,timeout}]
It's always twice, secfond one 5 seconds after the first, even though I only send delete request once and then terminate the client.
And typical ones for storage nodes:
[W] storage_0@192.168.3.53 2017-05-31 18:40:44.960991 +0300 1496245244 leo_storage_replicator:loop/6 216[{method,delete},{key,<<"bodytest/3a/fd/9b/3afd9b09a43527084a2d099b31989faff4755d2e31a99f0823d3d1501592513dedd1defe1b25cb1aa7e061cdb7d5c47e0400000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-05-31 18:40:45.961997 +0300 1496245245 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/3a/fd/9b/3afd9b09a43527084a2d099b31989faff4755d2e31a99f0823d3d1501592513dedd1defe1b25cb1aa7e061cdb7d5c47e0400000000000000.xz">>},{cause,timeout}]
and
[W] storage_1@192.168.3.54 2017-05-31 18:40:37.974512 +0300 1496245237 leo_storage_replicator:loop/6 216[{method,delete},{key,<<"bodytest/00/e6/66/00e666b4e8c62d42ccabb1b05c8111df33bf4c8a042881df5fcd33bde3d709f41c219f7cdd7ed58ec9947657201c5afa0010000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-31 18:40:37.981806 +0300 1496245237 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/00/e6/66/00e666b4e8c62d42ccabb1b05c8111df33bf4c8a042881df5fcd33bde3d709f41c219f7cdd7ed58ec9947657201c5afa0010000000000000.xz">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_7,{get,<<"bodytest/00/e6/66/00e666b4e8c62d42ccabb1b05c8111df33bf4c8a042881df5fcd33bde3d709f41c219f7cdd7ed58ec9947657201c5afa0010000000000000.xz">>},30000]}}}]
[W] storage_1@192.168.3.54 2017-05-31 18:40:38.977943 +0300 1496245238 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/00/e6/66/00e666b4e8c62d42ccabb1b05c8111df33bf4c8a042881df5fcd33bde3d709f41c219f7cdd7ed58ec9947657201c5afa0010000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-31 18:41:08.981477 +0300 1496245268 leo_storage_replicator:loop/6 216[{method,delete},{key,<<"bodytest/1d/43/9f/1d439fc40911b910064d4eba9f4d8c6f83cb6915a1e92da99170b47c996c200574106e99823b64c0fa15a0f973354cffa420010000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-05-31 18:41:09.981144 +0300 1496245269 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/1d/43/9f/1d439fc40911b910064d4eba9f4d8c6f83cb6915a1e92da99170b47c996c200574106e99823b64c0fa15a0f973354cffa420010000000000.xz">>},{cause,timeout}]
(here the second one with "replicate_fun" is a bit different from others)
@vstax Good to hear that works for you.
I remember watchdog giving problem in the past during upload experiments, but once you've implemented leo-project/leo_watchdog@8a30a17 they disappeared and I was able to do massive parallel uploads with watchdog enabled, so I kind of left it like that, thinking it shouldn't create any more problems. Apparently I was wrong, it affects these massive deletes as well.
It would rather be good for all of us because we could find another culprit in leo_watchdog :) I will file the issue that leo_watchdog for the disk could make leo_backend_db overloaded as another one later.
As interesting note, badarg from eleveldb which happened before soon after initial "delete bucket" request - after which delete queues stopped growing and started to consume - doesn't happen anymore as well. But the queues still behave like that, they grow fast to certain point (70-90K messages on each node), then start consuming. So apparently that badarg wasn't the cause of why "delete bucket" operation stops adding messages to queue, it must be something else. The only errors now are, on gateway:
It's still not clear but I have one assumption that could explain why delete-bucket operations could stop. As the reason is involved with how the Erlang runtime behave when an erlang process have lots of messages in its mailbox, the detailed/right explanation is too complicated so I'd like to cut long story short
In case of our delete-bucket case, Erlang processes that try to add messages in the leo_mq could block for a long time by Erlang Scheduler when the corresponding leo_mq_server process have lots of items in their mailbox. You can think of this mechanism as kind of back pressure algorithm in order to prevent a busy process from getting stuck with too much messages sent by others.
It's always twice, secfond one 5 seconds after the first, even though I only send delete request once and then terminate the client.
This will be solved when we finish to implement https://github.com/leo-project/leofs/issues/725#issuecomment-304567505.
Any error caused by timeout is thought as the result of punishments by Erlang Scheduler described above. we will try to make that happen as less times as possible anyway.
@vstax I'd like to ask you to do the same test with leo_watchdog_disk enabled after https://github.com/leo-project/leo_watchdog/pull/6 merged into the develop. That PR will fix delete-bucket problem even if watchdog for disk enabled.
@mocchira Unfortunately, it will take me some time to do this as it seems I need to wipe the whole cluster and fill it with data before trying again. I've tried removing another bucket, "body" (which has about the same amount of objects as bodytest had, around 1M) but get a problem: first of all, s3cmd completes instantly:
$ s3cmd rb s3://body
Bucket 's3://body/' removed
HTTP log:
<- DELETE http://body.s3.amazonaws.com/ HTTP/1.1
-> HTTP/1.1 204 No Content
Then I get this error on each storage node:
[E] storage_1@192.168.3.54 2017-06-02 20:48:36.825216 +0300 1496425716 leo_backend_db_eleveldb:prefix_search/3 223 {badrecord,metadata_3}
and that's it. Nothing else happens; I can create bucket again and it contains all the objects it did before.
E.g.
[root@leo-m0 ~]# /usr/local/bin/leofs-adm whereis body/eb/63/27/eb6327350a33926f4045d7138bec30fac791d4648265a4c387b03858c13a4931696a0058b7071810a4639a01d8de0c8c0079010000000000.xz
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
del? | node | ring address | size | checksum | has children | total chunks | clock | when
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
| storage_2@192.168.3.55 | c5a3967eea6e948e4a2cd59a1b5c1866 | 47K | 9b271ea883 | false | 0 | 550fcb2fa9862 | 2017-06-02 19:32:28 +0300
| storage_1@192.168.3.54 | c5a3967eea6e948e4a2cd59a1b5c1866 | 47K | 9b271ea883 | false | 0 | 550fcb2fa9862 | 2017-06-02 19:32:28 +0300
I've tried various settings and versions but always getting the same error.
"bodytest" seems to be no good anymore because I exhausted it, after a few tries there are no more objects in there (though I still get "timeout" twice on gateway and storage nodes produce CPU load for a minute or so - expected, I guess, at least until compaction is performed). Since removing "body" doesn't work at all, I'll need to fill cluster with whole new data.
Technically there should be no difference between "body" and "bodytest" buckets except for the fact that "bodytest" contains all objects in "subdirectories" like 00/01/01, while body contains that AND around 10K objects directly in the bucket without any "subdirectories" - which makes ls
operation on it impossible.
@vstax
Thanks for trying. It turned out that you hit the another issue when deleting the body bucket. Now objects created by LeoFS <= 1.3.2.1 are NOT removable with LeoFS >= 1.3.3 through a delete-bucket operation. I will file this issue later. So please give it another try once the issue will be fixed.
EDIT: filed the issue on https://github.com/leo-project/leofs/issues/754
@mocchira Nice, thanks. I'll test this fix some time after as I've moved to other experiments (but I kept the copy of data which gave me the problem with removing bucket).
Regarding fix from https://github.com/leo-project/leo_watchdog/pull/6 - it doesn't seem to do anything for me. The problem still persists on storage_1 and storage_2 which give watchdog warning. (warning, not an error - like you described at #747). I have latest leo_mq and leo_watchdog (0.12.8).
Unfiltered log of storage_1 looks like this:
[W] storage_1@192.168.3.54 2017-06-06 17:05:13.880756 +0300 1496757913 leo_watchdog_disk:check/4 307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,205360440},{available,39413936},{use_percentage,85},{use_percentage_str,"84%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_1@192.168.3.54 2017-06-06 17:05:23.894801 +0300 1496757923 leo_watchdog_disk:check/4 307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,205360440},{available,39413936},{use_percentage,85},{use_percentage_str,"84%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_1@192.168.3.54 2017-06-06 17:05:24.708635 +0300 1496757924 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/39/58/a6/3958a6e0b7e1f33eaec7c5634498bb65579d13b8dff8983943b9144359b74206627e398b44a7ab0a29eb00169a9651230600100000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-06-06 17:05:25.708490 +0300 1496757925 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/39/58/a6/3958a6e0b7e1f33eaec7c5634498bb65579d13b8dff8983943b9144359b74206627e398b44a7ab0a29eb00169a9651230600100000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-06-06 17:05:33.913248 +0300 1496757933 leo_watchdog_disk:check/4 307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,205360440},{available,39413936},{use_percentage,85},{use_percentage_str,"84%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_1@192.168.3.54 2017-06-06 17:05:43.921983 +0300 1496757943 leo_watchdog_disk:check/4 307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,205360556},{available,39413820},{use_percentage,85},{use_percentage_str,"84%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_1@192.168.3.54 2017-06-06 17:05:53.928851 +0300 1496757953 leo_watchdog_disk:check/4 307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,205364092},{available,39410284},{use_percentage,85},{use_percentage_str,"84%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
(and so on)
Same for storage_2:
[W] storage_2@192.168.3.55 2017-06-06 17:04:43.514332 +0300 1496757883 leo_watchdog_disk:check/4 307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202829220},{available,41945156},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_2@192.168.3.55 2017-06-06 17:04:53.526332 +0300 1496757893 leo_watchdog_disk:check/4 307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202829220},{available,41945156},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_2@192.168.3.55 2017-06-06 17:05:03.536143 +0300 1496757903 leo_watchdog_disk:check/4 307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202829220},{available,41945156},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_2@192.168.3.55 2017-06-06 17:05:12.791329 +0300 1496757912 leo_storage_handler_object:replicate_fun/3 1385 [{cause,"Could not get a metadata"}]
[E] storage_2@192.168.3.55 2017-06-06 17:05:12.791688 +0300 1496757912 leo_storage_handler_object:put/4 416 [{from,storage},{method,delete},{key,<<"bodytest/01/35/3f/01353fb26857021c29f1bfc11cd74d4f7ef7adc5a075f4dbaec11f2f79a37d1a8bf33f4afa1c0c22e8e012552e1a3c8e2c03010000000000.xz">>},{req_id,0},{cause,"Could not get a metadata"}]
[W] storage_2@192.168.3.55 2017-06-06 17:05:12.802626 +0300 1496757912 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/01/35/3f/01353fb26857021c29f1bfc11cd74d4f7ef7adc5a075f4dbaec11f2f79a37d1a8bf33f4afa1c0c22e8e012552e1a3c8e2c03010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-06-06 17:05:13.547962 +0300 1496757913 leo_watchdog_disk:check/4 307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202829220},{available,41945156},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_2@192.168.3.55 2017-06-06 17:05:13.819000 +0300 1496757913 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/01/35/3f/01353fb26857021c29f1bfc11cd74d4f7ef7adc5a075f4dbaec11f2f79a37d1a8bf33f4afa1c0c22e8e012552e1a3c8e2c03010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-06-06 17:05:23.570099 +0300 1496757923 leo_watchdog_disk:check/4 307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202829220},{available,41945156},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
[W] storage_2@192.168.3.55 2017-06-06 17:05:33.585550 +0300 1496757933 leo_watchdog_disk:check/4 307 [{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202829220},{available,41945156},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]
On both these nodes leo_async_deletion_queue
processing freezes soon after these "timeout" messages. On storage_0, for which watchdog doesn't trigger, the queue processes fine, despite same timeouts:
[W] storage_0@192.168.3.53 2017-06-06 17:05:12.775529 +0300 1496757912 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/01/35/3f/01353fb26857021c29f1bfc11cd74d4f7ef7adc5a075f4dbaec11f2f79a37d1a8bf33f4afa1c0c22e8e012552e1a3c8e2c03010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-06-06 17:05:13.770076 +0300 1496757913 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/01/35/3f/01353fb26857021c29f1bfc11cd74d4f7ef7adc5a075f4dbaec11f2f79a37d1a8bf33f4afa1c0c22e8e012552e1a3c8e2c03010000000000.xz">>},{cause,timeout}]
This is reproduced each time. If I disable watchdog on storage_1 and storage_2, they process messages fine, just like storage_0 (for which disk space watchdog is enabled but does not trigger).
As a side note: leofs-adm du <storage-node>
command, which normally executes instantly, hangs and eventually timeouts during delete bucket operation (initial part, when queues are filling). When queues stop filling and start consuming, it works again. A bug / undocumented feature? I kind of though that it's pretty lightweight operation since normally you get result right away even under load, but now I'm not so sure.
@vstax Thanks for retrying.
Regarding fix from leo-project/leo_watchdog#6 - it doesn't seem to do anything for me. The problem still persists on storage_1 and storage_2 which give watchdog warning. (warning, not an error - like you described at #747). I have latest leo_mq and leo_watchdog (0.12.8).
Got it. It seems there are still same kind of problems at other places so I will get back to vet again.
As a side note: leofs-adm du
command, which normally executes instantly, hangs and eventually timeouts during delete bucket operation (initial part, when queues are filling). When queues stop filling and start consuming, it works again. A bug / undocumented feature? I kind of though that it's pretty lightweight operation since normally you get result right away even under load, but now I'm not so sure.
Yes it's kind of known problem (Any response from leofs-adm can get delayed when there are lots of tasks generated by background jobs) at least among devs. Those problems can be mitigated by fixing https://github.com/leo-project/leofs/issues/753 however I will file as another issue for sure. Thanks for reminding me of that.
@mocchira I've tested fix for #754 and it works perfectly, I was able to delete the original "body" bucket. In about 8-10 create+delete bucket operations, anyway. Just like before, each delete generates 70-90k delete messages per storage node, so with 3 storage nodes, 2 copies and 1M objects, it's supposed (currently) to be like that - amount of messages that is generated before the operation stops is quite stable.
It's a pretty slow operation overall so I tried to observe it some and one interesting thing that I've noticed is that deletes go through AVS files in straight order: first all objects that are in 0.avs are found / marked as deleted, then it goes to 1.avs and so on. This confused me a bit - is it by design / for simplicity? Well, this operation is not disk-bound anyway, though (I can see it being CPU bound by "Eleveldb" threads, even though these threads don't really reach 100% CPU on average. Kind of feels like there are some internal locks preventing it to go faster, either that or not enough processing threads or something like that).
Through, of course, I understand that there is no real need for "delete bucket" operation to be of high performance, as long as it's reliable. Just some random observation.
Yes it's kind of known problem (Any response from leofs-adm can get delayed when there are lots of tasks generated by background jobs) at least among devs.
The reason why I asked about du
operation is that other response from leofs-adm is not delayed.. This is quite unlike the problem with watchdog/leo_backend_db which makes node use high CPU, doesn't allow you to get result from mq-stats
operation and so on.
When first stage of delete-bucket (filling of queues) is going on the node is very responsive overall. I get instant and precise numbers from mq-stats
operation as well. It's only du
that seems to be having problem during this time. Also I get (maybe wrong) feeling that it doesn't work at all until this stage is over, so it's not a simple delay. Right after queue stops filling, du
starts to give instant response, even through the node is busy consuming these messages. So I thought this might be something else.
@vstax
It's a pretty slow operation overall so I tried to observe it some and one interesting thing that I've noticed is that deletes go through AVS files in straight order: first all objects that are in 0.avs are found / marked as deleted, then it goes to 1.avs and so on. This confused me a bit - is it by design / for simplicity? Well, this operation is not disk-bound anyway, though (I can see it being CPU bound by "Eleveldb" threads, even though these threads don't really reach 100% CPU on average. Kind of feels like there are some internal locks preventing it to go faster, either that or not enough processing threads or something like that).
What roughly happens behind the scene are
and since there is one-to-one relationship between AVS and metadata(managed by eleveldb), as a result deletes happen from 1.avs to n.avs one by one. (Yes it's expected)
I think the reason why it's slow might be the sleep happen at regular intervals intentionally according to the configuration here https://github.com/leo-project/leofs/blob/1.3.4/apps/leo_storage/priv/leo_storage.conf#L236-L243 in order to reduce the load generated by background jobs.
So please retry with configuration tweaked more short intervals?
When first stage of delete-bucket (filling of queues) is going on the node is very responsive overall. I get instant and precise numbers from mq-stats operation as well. It's only du that seems to be having problem during this time. Also I get (maybe wrong) feeling that it doesn't work at all until this stage is over, so it's not a simple delay. Right after queue stops filling, du starts to give instant response, even through the node is busy consuming these messages. So I thought this might be something else.
Obviously something goes wrong! I will vet.
@vstax turned out the reason why du can get stuck and filed on https://github.com/leo-project/leofs/issues/758.
@mocchira Nice, thank you. Our monitoring tries to execute du
and compact-status
to monitor detailed node statistics as this information (ratio of active size, compaction status and compaction start date) is not available over SNMP, so it's good to know that when it doesn't produce any results nothing is seriously broken and it's a known issue.
Regarding sequential AVS processing: processing AVS files in the same directory in order is hardly a problem, I was more interested in how it processes files on multiple JBOD drives on real storage node - when each drive has its own directory for AVS files; if it doesn't utilize all drives in parallel, that might be somewhat a problem under certain conditions (or might be not a problem, since I don't really see any IO load during any stage of delete-bucket
operation). But as we are not running production LeoFS yet, I kind of just wondered about it beforehand.
Regarding sleep interval: strangely enough, it doesn't seem to be reason why it's slow. I've reduced interval 10 times and the speed is the same. I reduced intervals 500 times to original, setting "regular" to 1 msec and it's still the same. Example of monitoring queue once per second with sleep reduced to 1 msec:
[root@leo-m0 ~]# while sleep 1; do /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep leo_async_deletion_; done
leo_async_deletion_queue | running | 58261 | 1600 | 1 | async deletion of objs
leo_async_deletion_queue | running | 57861 | 1600 | 1 | async deletion of objs
leo_async_deletion_queue | running | 57861 | 1600 | 1 | async deletion of objs
leo_async_deletion_queue | running | 57461 | 1600 | 1 | async deletion of objs
leo_async_deletion_queue | running | 57061 | 1600 | 1 | async deletion of objs
leo_async_deletion_queue | running | 57061 | 1600 | 1 | async deletion of objs
leo_async_deletion_queue | running | 56661 | 1600 | 1 | async deletion of objs
leo_async_deletion_queue | running | 56661 | 1600 | 1 | async deletion of objs
leo_async_deletion_queue | running | 56661 | 1600 | 1 | async deletion of objs
leo_async_deletion_queue | running | 56261 | 1600 | 1 | async deletion of objs
Top output at that time for beam.smp
(per-thread):
top - 17:34:48 up 20:45, 1 user, load average: 1,02, 0,57, 0,29
Threads: 131 total, 0 running, 131 sleeping, 0 stopped, 0 zombie
%Cpu0 : 8,5 us, 2,8 sy, 0,0 ni, 48,4 id, 0,0 wa, 0,0 hi, 0,4 si, 39,9 st
%Cpu1 : 2,3 us, 1,0 sy, 0,0 ni, 89,8 id, 0,0 wa, 0,0 hi, 0,0 si, 6,9 st
%Cpu2 : 8,1 us, 2,4 sy, 0,0 ni, 84,1 id, 0,0 wa, 0,0 hi, 0,0 si, 5,4 st
%Cpu3 : 4,0 us, 1,3 sy, 0,0 ni, 86,5 id, 0,0 wa, 0,0 hi, 0,0 si, 8,3 st
%Cpu4 : 0,3 us, 0,7 sy, 0,0 ni, 96,7 id, 0,0 wa, 0,0 hi, 0,0 si, 2,3 st
%Cpu5 : 7,5 us, 2,4 sy, 0,0 ni, 81,6 id, 0,0 wa, 0,0 hi, 0,0 si, 8,5 st
%Cpu6 : 1,3 us, 0,7 sy, 0,0 ni, 92,1 id, 1,0 wa, 0,0 hi, 0,0 si, 4,9 st
%Cpu7 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem : 8010004 total, 531088 free, 410228 used, 7068688 buff/cache
KiB Swap: 4194300 total, 4194296 free, 4 used. 7255088 avail Mem
PID USER PR NI VIRT RES SHR S P %CPU %MEM TIME+ COMMAND
10550 leofs 20 0 5426228 219172 23572 S 0 39,2 2,7 1:23.18 1_scheduler
10638 leofs 20 0 5426228 219172 23572 S 1 27,2 2,7 1:50.93 Eleveldb
10558 leofs 20 0 5426228 219172 23572 S 1 17,6 2,7 0:37.84 aux
10551 leofs 20 0 5426228 219172 23572 S 6 7,2 2,7 0:44.62 2_scheduler
10526 leofs 20 0 5426228 219172 23572 S 6 4,1 2,7 0:05.02 async_10
10631 leofs 20 0 5426228 219172 23572 S 3 1,6 2,7 0:21.74 Eleveldb
10552 leofs 20 0 5426228 219172 23572 S 5 1,6 2,7 0:24.70 3_scheduler
10639 leofs 20 0 5426228 219172 23572 S 3 1,5 2,7 0:08.08 Eleveldb
10621 leofs 20 0 5426228 219172 23572 S 6 0,0 2,7 0:00.16 Eleveldb
10452 leofs 20 0 5426228 219172 23572 S 2 0,0 2,7 0:00.03 beam.smp
10515 leofs 20 0 5426228 219172 23572 S 3 0,0 2,7 0:00.00 sys_sig_dispatc
10516 leofs 20 0 5426228 219172 23572 S 6 0,0 2,7 0:00.00 sys_msg_dispatc
Another example:
PID USER PR NI VIRT RES SHR S P %CPU %MEM TIME+ COMMAND
12638 leofs 20 0 5349500 212376 57516 R 5 19,9 2,7 0:45.31 aux
12631 leofs 20 0 5349500 212376 57516 R 0 19,3 2,7 0:50.50 2_scheduler
12632 leofs 20 0 5349500 212376 57516 S 4 18,9 2,7 0:11.33 3_scheduler
12718 leofs 20 0 5349500 212376 57516 S 3 16,3 2,7 0:44.68 Eleveldb
12630 leofs 20 0 5349500 212376 57516 S 0 14,6 2,7 1:25.27 1_scheduler
12711 leofs 20 0 5349500 212376 57516 S 1 9,3 2,7 0:23.23 Eleveldb
12616 leofs 20 0 5349500 212376 57516 S 5 7,0 2,7 0:05.72 async_20
12704 leofs 20 0 5349500 212376 57516 S 1 5,3 2,7 0:04.01 Eleveldb
12712 leofs 20 0 5349500 212376 57516 S 1 1,3 2,7 0:05.32 Eleveldb
12719 leofs 20 0 5349500 212376 57516 S 5 1,0 2,7 0:13.96 Eleveldb
12633 leofs 20 0 5349500 212376 57516 S 4 0,7 2,7 0:00.56 4_scheduler
It always looks something like that during queue processing - 1/2/3 _scheduler
and aux
threads consuming most of CPU. I've also gathered leo_doctor log here: https://pastebin.com/bBT5mLLA
But, well, like I said, it probably doesn't matter that much for now so I don't think you should worry about it (I'm writing about it in detail just in case you might spot some anomaly caused by some problem related to this ticket).
@vstax
Thanks for your report in detail. The result gathered by leo_doctor revealed the fact that delete-bucket can be slow down due to the imbalanced items stored in async_deletion_queue that cause queue consumers to get stuck more than necessary. Now once the first phase of a delete-bucket done, items stored in async_deletion_queue look like the below.
| **ALL** items belonging to metadata_0 | **ALL** items belonging to metadata_1 | ... | **ALL** items belonging to metadata_n|
^ head
ALL items belonging to metadata_N converged into ONE big chunk. then the second phase that queue consumers pop items and delete the corresponding objects will work like the below.
consumer_0 consumer_1 consumer_2 consumer_3
| | | |
-----------------------------------------------------------------
| <--- congestion could happen here
metadata_N, object_storage_N
To solve this issue, we may have to iterate metadata in parallel and produce items distributed evenly in each metadata. I will file as another issue later on. Thanks again.
@vstax
This is reproduced each time. If I disable watchdog on storage_1 and storage_2, they process messages fine, just like storage_0 (for which disk space watchdog is enabled but does not trigger). Got it. It seems there are still same kind of problems at other places so I will get back to vet again.
It turned out that the overload problem has already gone and there is another beast that probably causes your problem filed here https://github.com/leo-project/leofs/issues/776.
let me summarize the remained issues around here.
Please let me know if I'm missing something.
@mocchira Thank you, yes, you are quite right and it's indeed #776 that happens if disk watchdog enabled and the system has >80% used disk space (and it can't be tweaked through config right now because that hardcoded value creates this problem as well, I think).
I might be nitpicking, but there is still one thing that bothers me: you describe this problem as batch size being reduced to 0 so that queue processing stops. First question: isn't it bad in general, stopping processing completely? What if some other trigger in different watchdog - say, high CPU usage (maybe caused by something else) - does the same? Might be a a limit of how much watchdog can reduce batch size a good idea, so it can never reduce it to 0? Instead, if it's at some defined minimum value and watchdog triggers, it output big fat warning in log files that something seems to be really wrong. I'm just wondering if this (safe limit) would be more productive from operating perspective than stopping all processing.
Second question: there were errors in logs, e.g. from https://github.com/leo-project/leofs/issues/725#issuecomment-304376426
[E] storage_1@192.168.3.54 2017-05-26 22:50:53.757233 +0300 1495828253 null:null 0 gen_fsm leo_async_deletion_queue_consumer_2_1 in state idling terminated with reason: {timeout,{gen_server,call,[leo_async_deletion_queue_message_0,status,30000]}}
[E] storage_1@192.168.3.54 2017-05-26 22:50:54.262824 +0300 1495828254 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_async_deletion_queue_consumer_2_1",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_async_deletion_queue_message_0",44,"status",44,"30000"],93]],125]],125]," in ",[["gen_fsm",58,"terminate",47,"7"],[32,108,105,110,101,32,"626"]]]]]
and
[E] storage_1@192.168.3.54 2017-05-26 22:51:35.367975 +0300 1495828295 leo_watchdog_sub:handle_info/2 165 {badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]},{leo_mq_api,'-decrease/3-lc$^0/1-0-',1,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_mq_api,decrease,3,[{file,"src/leo_mq_api.erl"},{line,242}]},{leo_storage_watchdog_sub,handle_notify,3,[{file,"src/leo_storage_watchdog_sub.erl"},{line,89}]},{leo_watchdog_sub,'-handle_info/2-fun-0-',3,[{file,"src/leo_watchdog_sub.erl"},{line,158}]},{lists,foreach,2,[{file,"lists.erl"},{line,1337}]},{leo_watchdog_sub,handle_info,2,[{file,"src/leo_watchdog_sub.erl"},{line,156}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,615}]}]}
Both of these messages can repeat a lot (second one, with '-decrease/3-lc$^0/1-0-' endlessly, actually). Isn't that an indication of some problem by itself? I mean, queue being at 0 and processing stopped is one thing, but some parts of system seem to try to work under these conditions and experience problems, instead of detecting that their actions are impossible.
@mocchira regarding the issues referenced here, there is (obviously) a main issue that you described at https://github.com/leo-project/leofs/issues/725#issuecomment-302606104, but after it's done there is a need to check that all the related sub-issues are gone, i.e.
(you've mentioned these problems before, I'm just writing them here for the checklist that you are creating)
I know of yet another issue, though I haven't reported it separately because I didn't do experiments yet. It's the one mentioned in #763:
[I] storage_1@192.168.3.54 2017-06-09 21:56:35.824217 +0300 1497034595 leo_compact_fsm_worker:running/2 392 {leo_compact_worker_7,{timeout,{gen_server,call,[leo_metadata_7,{put_value_to_new_db,<<"bodycopy/6b/e7/57/6be75707dc8329a3d120035c2ef2b28dbdc0a9bca7128d741df9cf7719f1743be6d587aa166409683b37358400c5f8efe814050000000000.xz">>,<<131,104,22,100,0,10,109,101,116,97,100,97,116,97,95,51,109,0,0,0,133,98,111,100,121,99,111,112,121,47,54,98,47,101,55,47,53,55,47,54,98,101,55,53,55,48,55,100,99,56,51,50,57,97,51,100,49,50,48,48,51,53,99,50,101,102,50,98,50,56,100,98,100,99,48,97,57,98,99,97,55,49,50,56,100,55,52,49,100,102,57,99,102,55,55,49,57,102,49,55,52,51,98,101,54,100,53,56,55,97,97,49,54,54,52,48,57,54,56,51,98,51,55,51,53,56,52,48,48,99,53,102,56,101,102,101,56,49,52,48,53,48,48,48,48,48,48,48,48,48,48,46,120,122,110,16,0,11,33,243,74,102,221,228,126,48,67,103,248,244,106,57,13,97,133,98,0,1,147,80,109,0,0,0,0,97,0,97,0,97,0,97,0,110,5,0,169,143,253,173,1,110,7,0,96,173,189,176,103,81,5,110,5,0,32,18,173,210,14,110,16,0,67,255,81,173,159,204,109,0,182,231,93,11,157,210,224,227,97,0,100,0,9,117,110,100,101,102,105,110,101,100,97,0,97,0,97,0,97,0,97,0,97,0>>},30000]}}}
I got a reason to believe that when "delete bucket" is processing objects from some AVS file (at least current, non-parallel version until #764 is implemented), the compaction for that AVS file, if it was going on at that moment, will fail. At very least I know that it happened for me for both nodes that were deleting objects from 7.avs - compaction for that file has failed on both of them, with info (!), not error message like above. I don't actually even have evidence that this message is a symptom of a compaction failing, it's more like:
delete bucket + compaction going through the same file =>
this timeout happens =>
compaction fails
but I yet have no evidence of the reason and a cause being like that yet. I think deletion is in the stage when objects listed in queue are actually deleted from the bucket, though.
@vstax
I might be nitpicking, but there is still one thing that bothers me: you describe this problem as batch size being reduced to 0 so that queue processing stops. First question: isn't it bad in general, stopping processing completely? What if some other trigger in different watchdog - say, high CPU usage (maybe caused by something else) - does the same? Might be a a limit of how much watchdog can reduce batch size a good idea, so it can never reduce it to 0? Instead, if it's at some defined minimum value and watchdog triggers, it output big fat warning in log files that something seems to be really wrong. I'm just wondering if this (safe limit) would be more productive from operating perspective than stopping all processing.
Good point. We had the configurable minimum settings in the past however those had gone for some reason (I couldn't remember off the top of my head). It might be time to take another look at this idea.
Both of these messages can repeat a lot (second one, with '-decrease/3-lc$^0/1-0-' endlessly, actually). Isn't that an indication of some problem by itself? I mean, queue being at 0 and processing stopped is one thing, but some parts of system seem to try to work under these conditions and experience problems, instead of detecting that their actions are impossible.
The reason why those errors have happened is due to https://github.com/leo-project/leofs/issues/764 (a leo_backend_db that is corresponding with the congested leo_mq_server couldn't respond to requests sent through leo_mq_api) so fixing #764 should make those errors less likely to happen. Regarding decrease/3 called endlessly, I can't answer preciously as I'm not the original author however it seems some parts depend on the current behavior (decrease/3, increase/3 called endlessly). (please correct me if there are some wrong explanations @yosukehara
(you've mentioned these problems before, I'm just writing them here for the checklist that you are creating)
Thanks! that's really helpful to us.
I got a reason to believe that when "delete bucket" is processing objects from some AVS file (at least current, non-parallel version until #764 is implemented), the compaction for that AVS file, if it was going on at that moment, will fail. At very least I know that it happened for me for both nodes that were deleting objects from 7.avs - compaction for that file has failed on both of them, with info (!), not error message like above. I don't actually even have evidence that this message is a symptom of a compaction failing, it's more like:
@mocchira I wanted to try how this works in latest devel version (with leo_manager version 1.3.5), but there seem to be complications. After restarting cluster with latest version I get this:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
[ERROR] Could not get records
Log files on both managers are full of these two errors that appear every 10 seconds:
[E] manager_0@192.168.3.50 2017-07-11 18:52:26.527367 +0300 1499788346 leo_manager_del_bucket_handler:handle_info/2, dequeue 219 [{cause,"Mnesia is not available"}]
[E] manager_0@192.168.3.50 2017-07-11 18:52:26.527879 +0300 1499788346 leo_manager_del_bucket_handler:handle_info/2, dequeue 229 [{cause,"Mnesia is not available"}]
[E] manager_0@192.168.3.50 2017-07-11 18:52:36.529233 +0300 1499788356 leo_manager_del_bucket_handler:handle_info/2, dequeue 219 [{cause,"Mnesia is not available"}]
[E] manager_0@192.168.3.50 2017-07-11 18:52:36.529698 +0300 1499788356 leo_manager_del_bucket_handler:handle_info/2, dequeue 229 [{cause,"Mnesia is not available"}]
I've tried restarting managers in different sequence and doing "start force-load" on master but it doesn't seem to change anything. The cluster seems to work fine otherwise. I can see lots of new queues in "mq-stats" output for storage nodes as well. There are no other interesting messages if enabling debug logs.
@vstax Thanks for trying. It turned out that mnesia tables for the new delete-bucket implementation were not created in case of version upgrades. we will push the fix later.
EDIT: https://github.com/leo-project/leofs/pull/785 the fix for your problem has been merged into develop so please give it a try.
@mocchira Thank you, this fix helped.
Regarding delete bucket operation in general: I don't think it quite works on my system. The same test system: 3 nodes, N=2, D=1, 2+ millions of objects in cluster (1M in "bodytest" bucket, 1M in "bodycopy" and a small amount in few other buckets). It means that delete bucket operation should remove roughly 650,000 objects on each node. Nodes are configured properly (num_of_mq_procs=4, debug logs disabled, disk watchdog disabled).
I execute "leofs-adm delete-bucket bodytest
[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node | node | state
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55 | enqueuing | 2017-07-12 22:01:19 +0300
storage_1@192.168.3.54 | enqueuing | 2017-07-12 22:01:19 +0300
storage_0@192.168.3.53 | enqueuing | 2017-07-12 22:01:19 +0300
info log on manager_0:
[I] manager_0@192.168.3.50 2017-07-12 20:26:56.909647 +0300 1499880416 leo_manager_del_bucket_handler:handle_call/3 - enqueue 128 [{"bucket_name",<<"bodytest">>},{"node",'storage_0@192.168.3.53'}]
[I] manager_0@192.168.3.50 2017-07-12 20:26:56.910019 +0300 1499880416 leo_manager_del_bucket_handler:handle_call/3 - enqueue 128 [{"bucket_name",<<"bodytest">>},{"node",'storage_1@192.168.3.54'}]
[I] manager_0@192.168.3.50 2017-07-12 20:26:56.910233 +0300 1499880416 leo_manager_del_bucket_handler:handle_call/3 - enqueue 128 [{"bucket_name",<<"bodytest">>},{"node",'storage_2@192.168.3.55'}]
[I] manager_0@192.168.3.50 2017-07-12 20:26:58.848183 +0300 1499880418 leo_manager_del_bucket_handler:notify_fun/3 264 [{"node",'storage_0@192.168.3.53'},{"bucket_name",<<"bodytest">>}]
[I] manager_0@192.168.3.50 2017-07-12 20:26:58.853656 +0300 1499880418 leo_manager_del_bucket_handler:notify_fun/3 264 [{"node",'storage_1@192.168.3.54'},{"bucket_name",<<"bodytest">>}]
[I] manager_0@192.168.3.50 2017-07-12 20:26:58.860125 +0300 1499880418 leo_manager_del_bucket_handler:notify_fun/3 264 [{"node",'storage_2@192.168.3.55'},{"bucket_name",<<"bodytest">>}]
All storage nodes get 120-150% CPU load, rarely peaking to 200-230%, very small disk load, soon I can see messages appearing in leo_delete_dir_queue_1
queue on each node (only in that queue). There are some usual timeout errors for delete operations in log files of storage nodes.
At some point - few minutes after the start of operation - the number in leo_delete_dir_queue_1 stops growing. It's fixed at some number for each node (these are current numbers for storage_0, storage_1 and storage_3):
leo_delete_dir_queue_1 | idling | 84294 | 1600 | 500 | deletion bucket #1
leo_delete_dir_queue_1 | idling | 93829 | 1600 | 500 | deletion bucket #1
leo_delete_dir_queue_1 | idling | 92810 | 1600 | 500 | deletion bucket #1
The load on nodes is about the same, the errors are about the same. Then queue leo_async_deletion_queue
starts to grow slowly; 1 message, then 2 messages, at some point - 10 messages or so. It usually grows at a rate of message every few minutes. At some point it started to grow faster, like every minute or two it gets roughly +10 messages.
Early part of error log on storage_1 (all the later parts and log on other nodes looks about the same):
[E] storage_1@192.168.3.54 2017-07-12 20:27:30.346997 +0300 1499880450 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-12 20:27:38.896010 +0300 1499880458 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-12 20:27:39.894537 +0300 1499880459 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{cause,timeout}]
[E] storage_1@192.168.3.54 2017-07-12 20:28:00.459117 +0300 1499880480 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-12 20:28:09.901206 +0300 1499880489 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/18/0e/37/180e37e1a6f6351bcaf29e1aaa8c5caa0c2c8a41952867f2bf0e0fbfcde5de3f65f42bf33a6d352dd637c28ec640c5ba00a2790000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-12 20:28:10.902269 +0300 1499880490 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/18/0e/37/180e37e1a6f6351bcaf29e1aaa8c5caa0c2c8a41952867f2bf0e0fbfcde5de3f65f42bf33a6d352dd637c28ec640c5ba00a2790000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-12 20:28:19.854854 +0300 1499880499 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-12 20:28:19.857589 +0300 1499880499 leo_storage_replicator:replicate_fun/2243 [{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_0,{get,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},30000]}}}]
[W] storage_1@192.168.3.54 2017-07-12 20:28:20.853112 +0300 1499880500 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{cause,timeout}]
[E] storage_1@192.168.3.54 2017-07-12 20:28:30.617920 +0300 1499880510 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-12 20:28:40.913576 +0300 1499880520 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/18/72/63/18726304f172a426fb2262362d53d4e252711bc0adf43fd9ca7a1ee5baeb6631f3828a17041908227b31420fe5ecba670600100000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-12 20:28:41.908985 +0300 1499880521 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/18/72/63/18726304f172a426fb2262362d53d4e252711bc0adf43fd9ca7a1ee5baeb6631f3828a17041908227b31420fe5ecba670600100000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-12 20:28:51.712962 +0300 1499880531 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-12 20:28:52.712510 +0300 1499880532 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{cause,timeout}]
Here, the message that contains "leo_storage_replicator:replicate_fun/2" is unique and happened only once on a single node. The rest of messages (from leo_storage_replicator:replicate/5, leo_storage_replicator:loop/6 and leo_storage_handler_del_directory:insert_messages/3) repeats all the time on all nodes.
Info log:
[I] storage_1@192.168.3.54 2017-07-12 20:26:58.857286 +0300 1499880418 leo_storage_handler_del_directory:run/5141 [{"msg: enqueued",<<"bodytest">>}]
[I] storage_1@192.168.3.54 2017-07-12 20:27:30.347326 +0300 1499880450 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
[I] storage_1@192.168.3.54 2017-07-12 20:27:38.894033 +0300 1499880458 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,head},{key,<<"bodytest/00/c5/04/00c5046ad7cde46b63c917692ae3cd66012c240c0c8de3183ed665bde9fd89c1edbb9792c190fcd553788154a88e1f993e10000000000000.xz">>},{processing_time,30001}]
[I] storage_1@192.168.3.54 2017-07-12 20:28:00.459587 +0300 1499880480 leo_object_storage_event:handle_event/254 [{cause,"slow operation"},{method,fetch},{key,<<"bodytest">>},{processing_time,30001}]
there are no other types of messages in info log except for these from "leo_object_storage_event:handle_event/2".
The problem: nothing else happens. Executing "du" on a storage node under this load is annoying, but eventually works at some point:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_1@192.168.3.54
active number of objects: 1502466
total number of objects: 1509955
active size of objects: 213408666322
total size of objects: 213421617475
ratio of active size: 99.99%
last compaction start: ____-__-__ __:__:__
last compaction end: ____-__-__ __:__:__
These are the same numbers as before "delete bucket" operation; or almost the same. In other words, no objects seem to be deleted, with old implementation the "ratio of active size" started to drop as soon as delete queue started to process (1-2 minutes after start of "delete bucket" operation); here, two hours have passed but storage nodes show the same numbers of objects. It's the same for all nodes.
The status for all queues is "idling". Somehow leo_async_deletion_queue managed to get over 230 messages during this time:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54|grep delet
leo_async_deletion_queue | running | 232 | 1600 | 500 | async deletion of objs
leo_delete_dir_queue_1 | idling | 93828 | 1600 | 500 | deletion bucket #1
leo_delete_dir_queue_2 | idling | 0 | 1600 | 500 | deletion bucket #2
leo_delete_dir_queue_3 | idling | 0 | 1600 | 500 | deletion bucket #3
leo_delete_dir_queue_4 | idling | 0 | 1600 | 500 | deletion bucket #4
leo_delete_dir_queue_5 | idling | 0 | 1600 | 500 | deletion bucket #5
leo_delete_dir_queue_6 | idling | 0 | 1600 | 500 | deletion bucket #6
leo_delete_dir_queue_7 | idling | 0 | 1600 | 500 | deletion bucket #7
leo_delete_dir_queue_8 | idling | 0 | 1600 | 500 | deletion bucket #8
leo_req_delete_dir_queue | idling | 0 | 1600 | 500 | request removing directories
Here are leo_doctor logs for storage_1: https://pastebin.com/mcm0AphX
@vstax I could find the culprit thanks to your further testing. https://github.com/leo-project/leo_object_storage/pull/10 should fix your problem so please give it another try after the PR get merged.
@mocchira
Thank you, this made a difference. After restarting manager & storage nodes with latest version things started to move. (btw, shutting down these nodes with ~90K messages in leo_delete_dir_queue_1
queue took over a minute for each node, and starting up after that took 5 minutes or so? I don't think I've seen such shutdown & startup times even when I had problems with watchdog before and was shutting down nodes that had similar amounts in "frozen" leo_async_deletion_queue
queues. There is "alarm_handler: {set,{system_memory_high_watermark,[]}}" logged in erlang.log during these 5 minutes of startup).
Anyhow, after startup extra 70-80K messages appeared in leo_delete_dir_queue_1
, a few messages in leo_async_deletion_queue
, e.g.:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_0@192.168.3.53
id | state | number of msgs | batch of msgs | interval | description
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
leo_async_deletion_queue | running | 9 | 1600 | 500 | async deletion of objs
leo_comp_meta_with_dc_queue | idling | 0 | 1600 | 500 | compare metadata w/remote-node
leo_delete_dir_queue_1 | idling | 146698 | 1600 | 500 | deletion bucket #1
(there is something strange, though: leo_async_deletion_queue is switching between running and idling here all the time but the number of messages in it stayed at 9 for all 10 minutes)
I was able to execute "du" and ratio of active size was dropping. However, ten minutes later I got here:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_1@192.168.3.54
id | state | number of msgs | batch of msgs | interval | description
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
leo_async_deletion_queue | idling | 0 | 1600 | 500 | async deletion of objs
leo_comp_meta_with_dc_queue | idling | 0 | 1600 | 500 | compare metadata w/remote-node
leo_delete_dir_queue_1 | idling | 0 | 1600 | 500 | deletion bucket #1
[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_1@192.168.3.54
active number of objects: 1161526
total number of objects: 1509955
active size of objects: 165918882826
total size of objects: 213421617475
ratio of active size: 77.74%
last compaction start: ____-__-__ __:__:__
last compaction end: ____-__-__ __:__:__
[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node | node | state
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55 | enqueuing | 2017-07-13 18:22:59 +0300
storage_1@192.168.3.54 | enqueuing | 2017-07-13 18:22:59 +0300
storage_0@192.168.3.53 | enqueuing | 2017-07-13 18:22:59 +0300
All queues at all nodes are at 0. No more objects are being removed - however, the final ratio of active size is supposed to be in 50-52% range, so not all objects were removed from the bucket.
And.. nothing else happens. There is nothing in error / info logs of manager nodes.
Error log from storage node:
[W] storage_1@192.168.3.54 2017-07-13 16:22:03.473647 +0300 1499952123 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/61/57/98/61579844256bbe76f6f05e7655b486b9cb5df369f15a28a980389f71eacd8ec9afa42c36451a504ee3e09322352ffce1388c170100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-13 16:22:03.677581 +0300 1499952123 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/2d/2b/d8/2d2bd8b1f701626d70f4f253384d748463a48f2007e61da5c104630267c049908977838a1028e268132585cbd268f3ca03c4020100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-13 16:22:04.472227 +0300 1499952124 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/61/57/98/61579844256bbe76f6f05e7655b486b9cb5df369f15a28a980389f71eacd8ec9afa42c36451a504ee3e09322352ffce1388c170100000000.xz">>},{cause,timeout}]
[skipped]
[W] storage_1@192.168.3.54 2017-07-13 16:33:49.576822 +0300 1499952829 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/a0/e7/ea/a0e7eafe1849e039b8e210c8dad8a8ea1272b2464926cc2091e711e47ec9af8e72ef4d462d2dd36a5e424f0fb19ff5bf1c62050000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-13 16:34:01.228851 +0300 1499952841 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/0b/cc/6a/0bcc6aa1cd433eecf3ce9cd7fde3c0055a52aa53359ebfebd01816f69b80e9fcf43b03bf4ddb452ed6d6082b6ade0e7108d2000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-13 16:34:02.226010 +0300 1499952842 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/0b/cc/6a/0bcc6aa1cd433eecf3ce9cd7fde3c0055a52aa53359ebfebd01816f69b80e9fcf43b03bf4ddb452ed6d6082b6ade0e7108d2000000000000.xz">>},{cause,timeout}]
All skipped messages look just like these two types of messages. It's the same for all other nodes. Nothing in info logs after startup messages. Here, 16:21 is when nodes finished starting up, 16:34 is around the time the processing has stopped.
Looking at top, I can see CPU usage going from near-zero to 20-30% or so for 1-2 seconds from time to time, then back at 0 for a few seconds. Just in case, leo_doctor report: https://pastebin.com/yk5SXS1N
I waited for 2 hours after that, no changes at all. Then I restarted one storage node (storage_1). I can see these short spikes in CPU usage again, but they are different - more short (0.5-1 sec) and to higher usage, like 60-100% (yes, I know that "top" can show quite unreliable values when trying to update too fast but I got another working LeoFS cluster to compare to so I can see that it's higher usage than normal). Here is leo_doctor report for this state: https://pastebin.com/iiiW8yWV
I've waited for 20 more minutes, nothing changed; I restarted both manager nodes, after that nothing has changed as well.
@vstax Thanks for the further testing.
btw, shutting down these nodes with ~90K messages in leo_delete_dir_queue_1 queue took over a minute for each node, and starting up after that took 5 minutes or so? I don't think I've seen such shutdown & startup times even when I had problems with watchdog before and was shutting down nodes that had similar amounts in "frozen" leo_async_deletion_queue queues.
Since the previous version without my patch caused leo_delete_dir_queue_1 to keep generating lots of items as long as leo_storage was running (To be precious, fetching all deleted objects and inserting those into leo_delete_dir_queue_1 happened many times behind the scene), lots of tombstones were generated at leveldb so that the leveldb compaction process got triggered and caused shutdown/start up to take much time I guess.
Also since other wired things you faced might be caused by the previous bad behavior, please give it another try with the clean state if possible?
@mocchira sure, I will (I'll rollback just storage nodes first to older snapshot, if it won't work, the whole cluster). But isn't manager supposed to retry deletion of objects from bucket now, including across storage node restarts and even storage node losing queue? Is there some simple way to diagnose why it doesn't happen (or happens, but nodes refuse to accept this job)? I was pretty sure that restarting either storage or manager nodes - or, in worst case, both, like I did - should at very least make it try to continue delete.
Also, somewhat random question - a (quite rare, but nevertheless) case of new node being introduced during deletion of large bucket - a new node is added to cluster, then rebalance is launched - will the objects that weren't yet deleted on other nodes get pushed to this node as part of rebalance operation, and not deleted in the future, or "delete bucket" job will be pushed to this node as well, so that even if it gets some of the objects temporarily, they will be removed in the end anyway?
EDIT: Rolling back storage nodes (then installing latest version and launching them) with current version of manager node (which is "enqueuing" bucket deletion) did not help; I tried restarting everything, removing all queues on storage nodes, including "delete bucket" queue as well but still no changes. I think there is either some bug here (as I understand the desired implementation, the manager is supposed to re-queue bucket deletion request since it wasn't completed and currently storage nodes aren't doing deletion), or maybe I'm understanding the logic wrong? If storage node isn't supposed to continue deletion like that, shouldn't there be some knob on manager node, like "delete-deleted-bucket" command or something :) Because currently in this situation I can't re-create bucket (it's forbidden) to delete it again, so it's unobvious how to get out of this state. I think - if the aim is to get really reliable "delete bucket" operation - more experiments and testing are needed here, but first things first.
Rolling back everything and repeating delete-bucket command: it works (well, mostly). Here is error log from storage_0 - I filtered out all the numerous "Replicate failure" and "cause,timeout" messages:
[E] storage_0@192.168.3.53 2017-07-14 21:19:55.575122 +0300 1500056395 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_0@192.168.3.53 2017-07-14 21:20:49.777239 +0300 1500056449 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_0@192.168.3.53 2017-07-14 21:22:03.488293 +0300 1500056523 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_0@192.168.3.53 2017-07-14 21:23:46.42101 +0300 1500056626 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_0@192.168.3.53 2017-07-14 21:24:17.690158 +0300 1500056657 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[W] storage_0@192.168.3.53 2017-07-14 21:24:17.698517 +0300 1500056657 leo_storage_handler_object:replicate_fun/3 1399 [{cause,"Could not get a metadata"}]
[E] storage_0@192.168.3.53 2017-07-14 21:24:17.702573 +0300 1500056657 leo_storage_handler_object:put/4 416 [{from,storage},{method,delete},{key,<<"bodytest/4b/46/38/4b463858a715ee23a484a29121097c6a9caa3b65af7f1c901de4850123545fb5fa3845145ead1229a748607308f2867628b8000000000000.xz">>},{req_id,0},{cause,"Could not get a metadata"}]
[E] storage_0@192.168.3.53 2017-07-14 21:26:20.829152 +0300 1500056780 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[W] storage_0@192.168.3.53 2017-07-14 21:26:28.113795 +0300 1500056788 leo_storage_replicator:replicate_fun/2243 [{key,<<"bodytest/05/2b/30/052b30c599682a3b8b328114610e406a5458b05102cfdda2656f4095589ae588ad3773646d0c78d843b30a441e2f58409815010000000000.xz">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_5,{get,<<"bodytest/05/2b/30/052b30c599682a3b8b328114610e406a5458b05102cfdda2656f4095589ae588ad3773646d0c78d843b30a441e2f58409815010000000000.xz">>},30000]}}}]
[E] storage_0@192.168.3.53 2017-07-14 21:28:51.495146 +0300 1500056931 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_0@192.168.3.53 2017-07-14 21:31:40.922056 +0300 1500057100 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_0@192.168.3.53 2017-07-14 21:34:25.719290 +0300 1500057265 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,40,178,86,218,64,6,172,208,7,135,201,36,182,30,23,89,109,0,0,0,133,98,111,100,121,116,101,115,116,47,99,53,47,56,54,47,54,100,47,99,53,56,54,54,100,102,51,49,102,56,53,97,54,53,50,99,54,54,99,49,101,101,49,98,49,55,99,54,48,98,54,55,57,54,56,52,100,53,49,51,57,51,54,102,48,102,102,51,101,101,52,50,98,101,97,54,52,100,98,52,57,98,53,53,102,53,49,98,100,99,56,54,49,101,53,55,98,101,99,55,54,98,48,101,52,50,100,53,53,101,54,99,101,100,52,48,48,51,50,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
The "leo_storage_handler_object:put/4" line is put of 0-byte object before deletion? So it happens during new delete-bucket operation as well, is this by design?
Similar log from storage_1:
[E] storage_1@192.168.3.54 2017-07-14 21:19:23.319297 +0300 1500056363 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54 2017-07-14 21:20:06.136898 +0300 1500056406 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54 2017-07-14 21:21:13.620240 +0300 1500056473 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54 2017-07-14 21:22:42.229278 +0300 1500056562 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54 2017-07-14 21:24:35.12173 +0300 1500056675 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54 2017-07-14 21:25:06.536219 +0300 1500056706 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54 2017-07-14 21:27:22.907614 +0300 1500056842 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[W] storage_1@192.168.3.54 2017-07-14 21:27:27.993717 +0300 1500056847 leo_storage_replicator:replicate_fun/2243 [{key,<<"bodytest/6b/10/b9/6b10b9777d3084b24f98defbd4260844227241b532cfba26a3ea0e4172bef61ff6dcd44a6f77434f434d926565e0b7c9d97e580000000000.xz\n2">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_5,{get,<<"bodytest/6b/10/b9/6b10b9777d3084b24f98defbd4260844227241b532cfba26a3ea0e4172bef61ff6dcd44a6f77434f434d926565e0b7c9d97e580000000000.xz\n2">>},30000]}}}]
[W] storage_1@192.168.3.54 2017-07-14 21:27:38.231337 +0300 1500056858 leo_storage_handler_object:replicate_fun/3 1399 [{cause,"Could not get a metadata"}]
[E] storage_1@192.168.3.54 2017-07-14 21:27:38.231656 +0300 1500056858 leo_storage_handler_object:put/4 416 [{from,storage},{method,delete},{key,<<"bodytest/59/e9/af/59e9afbd96a2a8e107c7fcabf679a6b353c3c1c2d365567161cefef081bb390a684367eb4ecb1b8e6cd7a7e77a21f14f087d380100000000.xz\n1">>},{req_id,0},{cause,"Could not get a metadata"}]
[E] storage_1@192.168.3.54 2017-07-14 21:30:07.153217 +0300 1500057007 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_1@192.168.3.54 2017-07-14 21:33:12.536375 +0300 1500057192 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[W] storage_1@192.168.3.54 2017-07-14 21:33:12.718884 +0300 1500057192 leo_storage_replicator:replicate_fun/2243 [{key,<<"bodytest/00/06/94/0006943d1713c0a231bcefe412a8dd9287bfde37f5177f02d13f35b8cd6507b4b074d7e8143b7e9b7ecc8e9ded4043dba013010000000000.xz">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_7,{get,<<"bodytest/00/06/94/0006943d1713c0a231bcefe412a8dd9287bfde37f5177f02d13f35b8cd6507b4b074d7e8143b7e9b7ecc8e9ded4043dba013010000000000.xz">>},30000]}}}]
[W] storage_1@192.168.3.54 2017-07-14 21:35:16.998265 +0300 1500057316 leo_storage_handler_directory:find_by_parent_dir/4 78 [{errors,[]},{bad_nodes,['storage_2@192.168.3.55','storage_1@192.168.3.54']},{cause,"Could not get metadatas"}]
[W] storage_1@192.168.3.54 2017-07-14 21:35:59.718192 +0300 1500057359 leo_storage_handler_directory:find_by_parent_dir/4 78 [{errors,[]},{bad_nodes,['storage_2@192.168.3.55']},{cause,"Could not get metadatas"}]
[E] storage_1@192.168.3.54 2017-07-14 21:37:59.412479 +0300 1500057479 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_1@192.168.3.54 2017-07-14 21:37:59.414564 +0300 1500057479 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,57,245,188,171,29,125,219,95,10,67,244,173,201,222,171,75,109,0,0,0,133,98,111,100,121,116,101,115,116,47,52,99,47,49,48,47,53,49,47,52,99,49,48,53,49,97,101,97,54,53,55,98,100,51,102,100,99,51,54,98,52,54,56,50,51,102,101,54,52,101,102,48,56,99,101,97,101,50,98,52,49,54,51,56,48,98,55,51,97,52,55,54,52,97,49,99,98,56,49,102,57,100,48,101,101,98,102,100,50,57,55,53,52,49,98,99,55,100,54,52,48,51,99,98,54,52,56,48,50,50,48,101,53,98,97,48,48,53,52,48,49,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_1@192.168.3.54 2017-07-14 21:37:59.420628 +0300 1500057479 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_1@192.168.3.54 2017-07-14 21:37:59.421313 +0300 1500057479 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.538.0> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_1@192.168.3.54 2017-07-14 21:40:38.343404 +0300 1500057638 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_1@192.168.3.54 2017-07-14 21:40:38.344038 +0300 1500057638 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_1@192.168.3.54 2017-07-14 21:40:38.344458 +0300 1500057638 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.14616.18> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
This one looks worse. Is it bad?
For storage_2:
[E] storage_2@192.168.3.55 2017-07-14 21:19:25.76735 +0300 1500056365 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55 2017-07-14 21:20:08.735703 +0300 1500056408 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55 2017-07-14 21:21:16.562799 +0300 1500056476 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55 2017-07-14 21:22:46.88858 +0300 1500056566 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55 2017-07-14 21:24:41.374639 +0300 1500056681 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55 2017-07-14 21:25:11.410660 +0300 1500056711 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55 2017-07-14 21:25:41.491085 +0300 1500056741 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55 2017-07-14 21:27:54.329783 +0300 1500056874 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55 2017-07-14 21:30:43.399117 +0300 1500057043 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[E] storage_2@192.168.3.55 2017-07-14 21:33:48.193631 +0300 1500057228 leo_storage_handler_del_directory:insert_messages/3 325 [{cause,error}]
[W] storage_2@192.168.3.55 2017-07-14 21:34:46.951687 +0300 1500057286 leo_storage_handler_directory:find_by_parent_dir/4 78 [{errors,[]},{bad_nodes,['storage_2@192.168.3.55','storage_1@192.168.3.54']},{cause,"Could not get metadatas"}]
[W] storage_2@192.168.3.55 2017-07-14 21:35:29.724027 +0300 1500057329 leo_storage_handler_directory:find_by_parent_dir/4 78 [{errors,[]},{bad_nodes,['storage_2@192.168.3.55','storage_1@192.168.3.54']},{cause,"Could not get metadatas"}]
[E] storage_2@192.168.3.55 2017-07-14 21:39:56.487206 +0300 1500057596 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in gen_server:call/3 line 212
[E] storage_2@192.168.3.55 2017-07-14 21:39:56.493595 +0300 1500057596 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_4",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_2@192.168.3.55 2017-07-14 21:39:56.494257 +0300 1500057596 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.545.0> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in context child_terminated
"Replicate failure" errors look like this:
error.20170714.21.2:[E] storage_0@192.168.3.53 2017-07-14 21:21:23.29522 +0300 1500056483 leo_storage_handler_object:delete/3 569 [{from,gateway},{method,del},{key,<<"bodytest/c0/cc/45/c0cc45da65c101e4ab7e910895dc21c00cbdfbb7e0e4dc95ddab44809a28ee37185a026604138ee5a7c0cade881826d1c8035c0000000000.xz\n2">>},{req_id,0},{cause,"Replicate failure"}]
error.20170714.21.2:[E] storage_0@192.168.3.53 2017-07-14 21:22:53.353052 +0300 1500056573 leo_storage_handler_object:delete/3 569 [{from,gateway},{method,del},{key,<<"bodytest/57/55/49/575549c0cc19ff922b960336f0961070f272a57b8529e56b03b5efe1767805838d3b3a3c78f88d03f58af214c3244ece5ba62b0100000000.xz\n4">>},{req_id,0},{cause,"Replicate failure"}]
error.20170714.21.2:[E] storage_0@192.168.3.53 2017-07-14 21:22:55.907665 +0300 1500056575 leo_storage_handler_object:delete/3 569 [{from,gateway},{method,del},{key,<<"bodytest/fc/9c/02/fc9c02b27fee332a60f0e83b1d694a519af941484a0c177187843589c37986893b5a2bef48f99b38f2756285d09251cccff3040100000000.xz\n4">>},{req_id,0},{cause,"Replicate failure"}]
Is it supposed to be like that ("from,gateway") message? I did bucket deletion directly from manager, so gateway wasn't involved, I think?
Now about problems:
[I] storage_1@192.168.3.54 2017-07-14 22:07:50.544929 +0300 1500059270 leo_storage_handler_del_directory:run/5558 [{"msg: dequeued and removed (bucket)",<<"bodytest">>}]
however it happened only about 20 minutes after the bucket deletion seemingly finished. Again, most of this time the state was "monitoring" (not sure if there is correlation and it was like that all the time, though)
[I] storage_2@192.168.3.55 2017-07-14 22:07:53.719258 +0300 1500059273 leo_storage_handler_del_directory:run/5558 [{"msg: dequeued and removed (bucket)",<<"bodytest">>}]
however the state never changed from "enqueuing" (which is shown with current time)! The deletion process has completed a long time ago, but manager doesn't seem to get that information:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node | node | state
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55 | finished | 2017-07-14 22:07:53 +0300
storage_1@192.168.3.54 | finished | 2017-07-14 22:07:50 +0300
storage_0@192.168.3.53 | enqueuing | 2017-07-14 22:30:02 +0300
[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|head -5 id | state | number of msgs | batch of msgs | interval | description --------------------------------+-------------+----------------|----------------|----------------|--------------------------------------------- leo_async_deletion_queue | idling | 0 | 1600 | 500 | async deletion of objs leo_comp_meta_with_dc_queue | idling | 0 | 1600 | 500 | compare metadata w/remote-node leo_delete_dir_queue_1 | idling | 0 | 1600 | 500 | deletion bucket #1
[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_2@192.168.3.55 active number of objects: 742982 total number of objects: 1493689 active size of objects: 105167655058 total size of objects: 210052283999 ratio of active size: 50.07% last compaction start: __-- :: last compaction end: __-- ::
this is most serious problem here, I think.
I'm unable to check if 100% of objects are removed or the errors or timeouts have caused some to remain because of this (as the easiest way to check is to re-create bucket, "diagnose-start" will show me all these removed objects so I don't want to rely on it yet).
@vstax Thanks for trying.
But isn't manager supposed to retry deletion of objects from bucket now, including across storage node restarts and even storage node losing queue? Is there some simple way to diagnose why it doesn't happen (or happens, but nodes refuse to accept this job)? I was pretty sure that restarting either storage or manager nodes - or, in worst case, both, like I did - should at very least make it try to continue delete.
Yes it's supposed to retry. However the first 1.3.5-rc3 (without my patch) had a serious problem that might cause the internal state to get inconsistent so that I'd recommend try with the clean state for safe. Please let me know if you face the same problem when using 1.3.5-rc3 with my patch and data created with >= that version.
Also, somewhat random question - a (quite rare, but nevertheless) case of new node being introduced during deletion of large bucket - a new node is added to cluster, then rebalance is launched - will the objects that weren't yet deleted on other nodes get pushed to this node as part of rebalance operation, and not deleted in the future, or "delete bucket" job will be pushed to this node as well, so that even if it gets some of the objects temporarily, they will be removed in the end anyway?
Good question. when the rebalance is launched during delete-bucket is ongoing, any objects that belonged to a deleted bucket are not transferred to a new node so that there is no need to delete on a new node.
Rolling back storage nodes (then installing latest version and launching them) with current version of manager node (which is "enqueuing" bucket deletion) did not help; I tried restarting everything, removing all queues on storage nodes, including "delete bucket" queue as well but still no changes. I think there is either some bug here (as I understand the desired implementation, the manager is supposed to re-queue bucket deletion request since it wasn't completed and currently storage nodes aren't doing deletion), or maybe I'm understanding the logic wrong? If storage node isn't supposed to continue deletion like that, shouldn't there be some knob on manager node, like "delete-deleted-bucket" command or something :) Because currently in this situation I can't re-create bucket (it's forbidden) to delete it again, so it's unobvious how to get out of this state. I think - if the aim is to get really reliable "delete bucket" operation - more experiments and testing are needed here, but first things first.
Got it. since there are records in mnesia on manager(s) managing delete-bucket stats(pending, enqueuing, monitoring, finished), the usual rolling back strategy don't work if there is any differences between manager(s) and storage(s). so as you said, providing a command named (delete-deleted-bucket) you suggested might be needed for safe. we'd consider to add such command with its appropriate name.
Rolling back everything and repeating delete-bucket command: it works (well, mostly).
Could you elaborate about what repeating means?
The "leo_storage_handler_object:put/4" line is put of 0-byte object before deletion? So it happens during new delete-bucket operation as well, is this by design?
put/4 can be called with setting delete-flag to true when replicating a delete request on a remote node so yes it is by design.
This one looks worse. Is it bad?
Yes seems to be bad. What really matters is leo_storage_handler_del_directory:insert_messages failed many times. (This function should not fail under normal circumstances so I will vet how/when it can happen)
Is it supposed to be like that ("from,gateway") message? I did bucket deletion directly from manager, so gateway wasn't involved, I think?
Yes such kind of errors should not appear in log, I will vet.
For two nodes - storage_0 and storage_1 the bucket deletion stats went "enqueuing->monitoring->finished" fine, however it took really long time for it to switch from "monitoring" to "finished". E.g. on storage_0 the deletion queue was empty at around 21:40, but the state switched from "monitoring" to "finished" only about 25 minutes later. Also, the message "dequeued and removed" never appeared in log file. Not saying that this is a serious problem, but mentioning this just in case
Taking time to switch from "monitoring" to "finished" is expected. Let me explain what the each state mean.
Also, the message "dequeued and removed" never appeared in log file. Not saying that this is a serious problem, but mentioning this just in case
It's problematic.
however the state never changed from "enqueuing" (which is shown with current time)! The deletion process has completed a long time ago, but manager doesn't seem to get that information:
The same root problem could cause this behavior and the above "dequeued and removed" never appeared. I will vet in depth.
@vstax found the other problem that cause leo_storage_handler_del_directory:insert_messages to fail. Please give it another try with the clean state once https://github.com/leo-project/leo_backend_db/pull/11 get merged.
Note: other fixes I mentioned on the above comment are WIP(I will send PR tomorrow)
@vstax other issues described on the above comment will be fixed once https://github.com/leo-project/leofs/pull/786 get merged.
Got it. since there are records in mnesia on manager(s) managing delete-bucket stats(pending, enqueuing, monitoring, finished), the usual rolling back strategy don't work if there is any differences between manager(s) and storage(s). so as you said, providing a command named (delete-deleted-bucket) you suggested might be needed for safe. we'd consider to add such command with its appropriate name.
reset-delete-bucket-stats
added.
Is it supposed to be like that ("from,gateway") message? I did bucket deletion directly from manager, so gateway wasn't involved, I think?
Yes such kind of errors should not appear in log, I will vet.
{from, storage}
or {from, leo_mq}
outputted instead of {from, gateway}
.
The same root problem could cause this behavior and the above "dequeued and removed" never appeared. I will vet in depth.
Fixed the finish notification to managers must happen (strict retry mechanism is implemented)
Please check these improvements along with the fix on leo_backend_db.
@mocchira Thank you for your support.
Rolling back everything and repeating delete-bucket command: it works (well, mostly).
Could you elaborate about what repeating means?
Just doing "delete-bucket" here; "repeating" as in repeating the experiment because I rolled back managers as well to the state when "delete-bucket" was never executed.
I've tried deleting the same bucket with these changes, and I don't think that fix for "finish notification" works; the state remains "enqueuing" even 1 hour after delete operation has finished. This happens for all storage nodes:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node | node | state
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55 | enqueuing | 2017-07-19 20:27:27 +0300
storage_1@192.168.3.54 | enqueuing | 2017-07-19 20:27:26 +0300
storage_0@192.168.3.53 | enqueuing | 2017-07-19 20:27:28 +0300
Error log on storage_0:
[W] storage_0@192.168.3.53 2017-07-19 19:10:49.513796 +0300 1500480649 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/00/e4/ce/00e4ce45fd1c5de4122221d44289f4ace93dc0d046fead4f4d3549b7b756af04621a4ab684a1c7db7b8d5f017555484d90d7000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:10:50.491836 +0300 1500480650 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/00/e4/ce/00e4ce45fd1c5de4122221d44289f4ace93dc0d046fead4f4d3549b7b756af04621a4ab684a1c7db7b8d5f017555484d90d7000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:11:29.514890 +0300 1500480689 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/01/42/6d/01426d00bc6a92b42e0399efbced9661579b45cc7001cd89db32d95c9930a2dd0b8f95122da45bf9be35de95dd816c5df2f6000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:11:30.479497 +0300 1500480690 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/01/42/6d/01426d00bc6a92b42e0399efbced9661579b45cc7001cd89db32d95c9930a2dd0b8f95122da45bf9be35de95dd816c5df2f6000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:11:52.677341 +0300 1500480712 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/55/dd/da/55dddad03e531499b6844655a7621ad3c1fa4fd1653de1187d2055a5d4edc719250d77b7311852571820300783c73ee2f7fb030100000000.xz\n4">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:11:52.686279 +0300 1500480712 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/55/dd/da/55dddad03e531499b6844655a7621ad3c1fa4fd1653de1187d2055a5d4edc719250d77b7311852571820300783c73ee2f7fb030100000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:11:53.675244 +0300 1500480713 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/55/dd/da/55dddad03e531499b6844655a7621ad3c1fa4fd1653de1187d2055a5d4edc719250d77b7311852571820300783c73ee2f7fb030100000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:11:53.677522 +0300 1500480713 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/55/dd/da/55dddad03e531499b6844655a7621ad3c1fa4fd1653de1187d2055a5d4edc719250d77b7311852571820300783c73ee2f7fb030100000000.xz\n4">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:11:53.677857 +0300 1500480713 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/55/dd/da/55dddad03e531499b6844655a7621ad3c1fa4fd1653de1187d2055a5d4edc719250d77b7311852571820300783c73ee2f7fb030100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_0@192.168.3.53 2017-07-19 19:12:01.484527 +0300 1500480721 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/02/a2/84/02a28429564d4d1430fdfb6603516bfdae6ad7af6b5c53115dca766df28cb4e1c950bf84d35890e1bda40b396833b50dbc76010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:12:07.455874 +0300 1500480727 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/02/a2/84/02a28429564d4d1430fdfb6603516bfdae6ad7af6b5c53115dca766df28cb4e1c950bf84d35890e1bda40b396833b50dbc76010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:12:08.404983 +0300 1500480728 leo_storage_handler_object:replicate_fun/31406 [{cause,"Could not get a metadata"}]
[E] storage_0@192.168.3.53 2017-07-19 19:12:08.405934 +0300 1500480728 leo_storage_handler_object:put/4 423 [{from,storage},{method,delete},{key,<<"bodytest/0f/1b/2f/0f1b2fbb7a99d5d05ab57604781b3b9aef44f62a7666b293c97bc1814c23025174f50dfdbdb7ae67118dbb3054c4dd71e87ca30000000000.xz\n2">>},{req_id,0},{cause,"Could not get a metadata"}]
[W] storage_0@192.168.3.53 2017-07-19 19:12:08.656845 +0300 1500480728 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/01/a7/e7/01a7e77059f793397d743aa7d6564c1bd8455087dfa98d81d37c3c68f32a33157665a8c7a2a83006b51ae7774bc387d40cec000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:12:09.633137 +0300 1500480729 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/01/a7/e7/01a7e77059f793397d743aa7d6564c1bd8455087dfa98d81d37c3c68f32a33157665a8c7a2a83006b51ae7774bc387d40cec000000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:12:40.635781 +0300 1500480760 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/01/79/02/0179021507a19bc291d357e2af7c8b384ed202399013a013cc4fe4ffdcba45c0ce4330af290e19ff07c382a74d0d403628fe010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:12:52.98035 +0300 1500480772 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/01/79/02/0179021507a19bc291d357e2af7c8b384ed202399013a013cc4fe4ffdcba45c0ce4330af290e19ff07c382a74d0d403628fe010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:12:53.43253 +0300 1500480773 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/00/56/99/005699470708547cd51606f1e613bc03011c0c9ee072e7af9a3f2d8645a555fd05c5580f1263b1404df12e3809cd2d60e422010000000000.xz">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_7,{get,<<"bodytest/00/56/99/005699470708547cd51606f1e613bc03011c0c9ee072e7af9a3f2d8645a555fd05c5580f1263b1404df12e3809cd2d60e422010000000000.xz">>},30000]}}}]
[W] storage_0@192.168.3.53 2017-07-19 19:12:54.43895 +0300 1500480774 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/00/56/99/005699470708547cd51606f1e613bc03011c0c9ee072e7af9a3f2d8645a555fd05c5580f1263b1404df12e3809cd2d60e422010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:13:23.49232 +0300 1500480803 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/00/56/99/005699470708547cd51606f1e613bc03011c0c9ee072e7af9a3f2d8645a555fd05c5580f1263b1404df12e3809cd2d60e422010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:13:24.113435 +0300 1500480804 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/00/56/99/005699470708547cd51606f1e613bc03011c0c9ee072e7af9a3f2d8645a555fd05c5580f1263b1404df12e3809cd2d60e422010000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:13:25.50324 +0300 1500480805 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/07/53/51/0753515f78df7f271a5e61c20bcd36a1a8d600cd0c592dfb875de2d4f1aedb207b80a43cf724051b6552bb6e539e9afc0027020000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:13:34.913529 +0300 1500480814 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/07/53/51/0753515f78df7f271a5e61c20bcd36a1a8d600cd0c592dfb875de2d4f1aedb207b80a43cf724051b6552bb6e539e9afc0027020000000000.xz">>},{cause,timeout}]
[W] storage_0@192.168.3.53 2017-07-19 19:13:34.914408 +0300 1500480814 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/00/56/99/005699470708547cd51606f1e613bc03011c0c9ee072e7af9a3f2d8645a555fd05c5580f1263b1404df12e3809cd2d60e422010000000000.xz">>},{cause,timeout}]
on storage_1:
[W] storage_1@192.168.3.54 2017-07-19 19:09:50.480513 +0300 1500480590 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/52/1d/82/521d82ec1c121d384f0ae426745289d7d48359eab6c2cd99c176419fd897a17875d3f2f6fef5584dea4f130504d90fee4866770000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:09:50.486326 +0300 1500480590 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/52/1d/82/521d82ec1c121d384f0ae426745289d7d48359eab6c2cd99c176419fd897a17875d3f2f6fef5584dea4f130504d90fee4866770000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:09:51.476814 +0300 1500480591 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/52/1d/82/521d82ec1c121d384f0ae426745289d7d48359eab6c2cd99c176419fd897a17875d3f2f6fef5584dea4f130504d90fee4866770000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:09:51.477865 +0300 1500480591 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/52/1d/82/521d82ec1c121d384f0ae426745289d7d48359eab6c2cd99c176419fd897a17875d3f2f6fef5584dea4f130504d90fee4866770000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:09:51.478555 +0300 1500480591 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/52/1d/82/521d82ec1c121d384f0ae426745289d7d48359eab6c2cd99c176419fd897a17875d3f2f6fef5584dea4f130504d90fee4866770000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-19 19:10:31.563548 +0300 1500480631 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/06/be/2c/06be2cb88c9fb144797158d9b4dceaa6a7985bb629e4fbb6eda7ef96916aee78fc2463cc1b1f8cd8f78790919fae115314ac050000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:10:47.178852 +0300 1500480647 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/06/be/2c/06be2cb88c9fb144797158d9b4dceaa6a7985bb629e4fbb6eda7ef96916aee78fc2463cc1b1f8cd8f78790919fae115314ac050000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:07.928271 +0300 1500480667 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:12.860699 +0300 1500480672 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/e5/a3/84/e5a384f216d906661267a25f00c7e91f69d61e0b628d527940226028734d07dd7fc12da6dcb17936a2ac13ff1a15087248d5df0000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:12.865019 +0300 1500480672 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/e5/a3/84/e5a384f216d906661267a25f00c7e91f69d61e0b628d527940226028734d07dd7fc12da6dcb17936a2ac13ff1a15087248d5df0000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:13.859764 +0300 1500480673 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/e5/a3/84/e5a384f216d906661267a25f00c7e91f69d61e0b628d527940226028734d07dd7fc12da6dcb17936a2ac13ff1a15087248d5df0000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:13.860610 +0300 1500480673 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/e5/a3/84/e5a384f216d906661267a25f00c7e91f69d61e0b628d527940226028734d07dd7fc12da6dcb17936a2ac13ff1a15087248d5df0000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:13.861225 +0300 1500480673 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/e5/a3/84/e5a384f216d906661267a25f00c7e91f69d61e0b628d527940226028734d07dd7fc12da6dcb17936a2ac13ff1a15087248d5df0000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:28.692642 +0300 1500480688 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/0f/c3/90/0fc390382e880d2abde1858f795349a3a0ec549cc61e59009b84433b7ec4d98a771000d7e4acf3cb827f4c90942f21e5989b040000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:47.549966 +0300 1500480707 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/05/e1/67/05e1671b6b5a22a1d3d19f6b635298859fb4ba66a08b31ff251c9004acc8096ea20b01c55bb8fe68fac3f9a1b41cd5dec580000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:48.547618 +0300 1500480708 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/05/e1/67/05e1671b6b5a22a1d3d19f6b635298859fb4ba66a08b31ff251c9004acc8096ea20b01c55bb8fe68fac3f9a1b41cd5dec580000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:51.430967 +0300 1500480711 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/cb/f1/83/cbf183675706473d4b747c12281bf50fc541a238ca442b2e14fad0eb71476657aa0cfa79aed45ab98f195745aabf8885b843580000000000.xz\n1">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:51.431480 +0300 1500480711 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/cb/f1/83/cbf183675706473d4b747c12281bf50fc541a238ca442b2e14fad0eb71476657aa0cfa79aed45ab98f195745aabf8885b843580000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:52.418610 +0300 1500480712 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/cb/f1/83/cbf183675706473d4b747c12281bf50fc541a238ca442b2e14fad0eb71476657aa0cfa79aed45ab98f195745aabf8885b843580000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:52.430601 +0300 1500480712 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/cb/f1/83/cbf183675706473d4b747c12281bf50fc541a238ca442b2e14fad0eb71476657aa0cfa79aed45ab98f195745aabf8885b843580000000000.xz\n1">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:52.430914 +0300 1500480712 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/cb/f1/83/cbf183675706473d4b747c12281bf50fc541a238ca442b2e14fad0eb71476657aa0cfa79aed45ab98f195745aabf8885b843580000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:55.542012 +0300 1500480715 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/70/01/39/700139f980d397cff8b770b72b53ae91474a7abcb59df0b2d246bbde6178b3b7a173e2cc6b16cd13922f61a5a12609404003630000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:55.543962 +0300 1500480715 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/70/01/39/700139f980d397cff8b770b72b53ae91474a7abcb59df0b2d246bbde6178b3b7a173e2cc6b16cd13922f61a5a12609404003630000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:56.542039 +0300 1500480716 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/70/01/39/700139f980d397cff8b770b72b53ae91474a7abcb59df0b2d246bbde6178b3b7a173e2cc6b16cd13922f61a5a12609404003630000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:56.542573 +0300 1500480716 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/70/01/39/700139f980d397cff8b770b72b53ae91474a7abcb59df0b2d246bbde6178b3b7a173e2cc6b16cd13922f61a5a12609404003630000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:11:56.542915 +0300 1500480716 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/70/01/39/700139f980d397cff8b770b72b53ae91474a7abcb59df0b2d246bbde6178b3b7a173e2cc6b16cd13922f61a5a12609404003630000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-19 19:12:34.366773 +0300 1500480754 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/02/c4/1b/02c41b78d3d44275fd5d817387b5c21bb1f5ffb907af90c20042428c1671a36357253eafe70aaaa0bb64876d52c7b1e3580b040000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:12:34.737725 +0300 1500480754 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/9d/94/18/9d9418650983a9e30bd7f13a673ca730b2f0d4768c9219216295d63dcee018c0b987bc4953cf8ae704008d8453ba54eba8fc700000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:12:34.742791 +0300 1500480754 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/9d/94/18/9d9418650983a9e30bd7f13a673ca730b2f0d4768c9219216295d63dcee018c0b987bc4953cf8ae704008d8453ba54eba8fc700000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:12:35.738158 +0300 1500480755 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/9d/94/18/9d9418650983a9e30bd7f13a673ca730b2f0d4768c9219216295d63dcee018c0b987bc4953cf8ae704008d8453ba54eba8fc700000000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:12:35.738436 +0300 1500480755 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/9d/94/18/9d9418650983a9e30bd7f13a673ca730b2f0d4768c9219216295d63dcee018c0b987bc4953cf8ae704008d8453ba54eba8fc700000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:12:35.738732 +0300 1500480755 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/9d/94/18/9d9418650983a9e30bd7f13a673ca730b2f0d4768c9219216295d63dcee018c0b987bc4953cf8ae704008d8453ba54eba8fc700000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-19 19:12:44.985651 +0300 1500480764 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/c1/cb/49/c1cb49681fc5b5faecef846d33ac51381a3072bf18f7ddb96cb4e460124d6533f1fdf7d85bc6a85fb5236068b552184bbfc7020100000000.xz\n4">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:12:44.987551 +0300 1500480764 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/c1/cb/49/c1cb49681fc5b5faecef846d33ac51381a3072bf18f7ddb96cb4e460124d6533f1fdf7d85bc6a85fb5236068b552184bbfc7020100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:12:45.985933 +0300 1500480765 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/c1/cb/49/c1cb49681fc5b5faecef846d33ac51381a3072bf18f7ddb96cb4e460124d6533f1fdf7d85bc6a85fb5236068b552184bbfc7020100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:12:45.986515 +0300 1500480765 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/c1/cb/49/c1cb49681fc5b5faecef846d33ac51381a3072bf18f7ddb96cb4e460124d6533f1fdf7d85bc6a85fb5236068b552184bbfc7020100000000.xz\n4">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:12:45.986925 +0300 1500480765 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/c1/cb/49/c1cb49681fc5b5faecef846d33ac51381a3072bf18f7ddb96cb4e460124d6533f1fdf7d85bc6a85fb5236068b552184bbfc7020100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-19 19:12:52.102327 +0300 1500480772 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/02/c4/1b/02c41b78d3d44275fd5d817387b5c21bb1f5ffb907af90c20042428c1671a36357253eafe70aaaa0bb64876d52c7b1e3580b040000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:13:19.736093 +0300 1500480799 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/00/e6/66/00e666b4e8c62d42ccabb1b05c8111df33bf4c8a042881df5fcd33bde3d709f41c219f7cdd7ed58ec9947657201c5afa0010000000000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:13:27.36886 +0300 1500480807 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/8f/8d/7c/8f8d7c273b8eadb4aeab29726ff21a5f39e5f75ef05ce9185afd769cc9f22bb8f0be0780dc30f595fc7e350dd946ffa880e73f0100000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:13:27.38654 +0300 1500480807 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/8f/8d/7c/8f8d7c273b8eadb4aeab29726ff21a5f39e5f75ef05ce9185afd769cc9f22bb8f0be0780dc30f595fc7e350dd946ffa880e73f0100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:13:28.36973 +0300 1500480808 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/8f/8d/7c/8f8d7c273b8eadb4aeab29726ff21a5f39e5f75ef05ce9185afd769cc9f22bb8f0be0780dc30f595fc7e350dd946ffa880e73f0100000000.xz\n2">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:13:28.37329 +0300 1500480808 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/8f/8d/7c/8f8d7c273b8eadb4aeab29726ff21a5f39e5f75ef05ce9185afd769cc9f22bb8f0be0780dc30f595fc7e350dd946ffa880e73f0100000000.xz">>},{cause,timeout}]
[W] storage_1@192.168.3.54 2017-07-19 19:13:28.37599 +0300 1500480808 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/8f/8d/7c/8f8d7c273b8eadb4aeab29726ff21a5f39e5f75ef05ce9185afd769cc9f22bb8f0be0780dc30f595fc7e350dd946ffa880e73f0100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-19 19:13:34.918759 +0300 1500480814 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/00/e6/66/00e666b4e8c62d42ccabb1b05c8111df33bf4c8a042881df5fcd33bde3d709f41c219f7cdd7ed58ec9947657201c5afa0010000000000000.xz">>},{cause,timeout}]
[E] storage_1@192.168.3.54 2017-07-19 19:13:42.151582 +0300 1500480822 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,17,166,100,223,94,65,242,145,215,154,235,43,100,217,227,139,109,0,0,0,133,98,111,100,121,116,101,115,116,47,53,53,47,57,102,47,97,48,47,53,53,57,102,97,48,51,51,98,52,49,54,99,57,97,48,57,56,57,50,52,53,102,102,99,52,48,48,57,52,57,54,53,49,50,54,99,55,51,51,102,102,49,50,57,98,57,97,50,97,99,51,50,52,48,52,98,100,50,55,98,49,48,52,53,49,102,48,57,54,55,54,48,49,49,55,50,99,53,55,48,48,97,57,48,50,98,54,102,52,53,53,50,102,100,51,56,56,52,97,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
on storage_2:
[W] storage_2@192.168.3.55 2017-07-19 19:10:28.575244 +0300 1500480628 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz\n2">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:10:28.579190 +0300 1500480628 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:10:29.526649 +0300 1500480629 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:10:29.528093 +0300 1500480629 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz\n3">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:10:29.571610 +0300 1500480629 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:10:29.574350 +0300 1500480629 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz\n2">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:10:29.574791 +0300 1500480629 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55 2017-07-19 19:10:30.514094 +0300 1500480630 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:10:30.527017 +0300 1500480630 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz\n3">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:10:30.527362 +0300 1500480630 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55 2017-07-19 19:10:32.575664 +0300 1500480632 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/1a/47/31/1a47310303359a0d6341c0486f945c79977aee816b95ee2abb54eb94aaa130a3695993e4c22be6d7954b49fd01eb5523004a020000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:10:47.179175 +0300 1500480647 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/1a/47/31/1a47310303359a0d6341c0486f945c79977aee816b95ee2abb54eb94aaa130a3695993e4c22be6d7954b49fd01eb5523004a020000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:11:06.923457 +0300 1500480666 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/01/03/fe/0103fef5e9c71fd48065caa8f376e6c4920b8396c87019151db19479c313e13fc7e50b8856c59d674221b5d2aeade507b04c010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:11:28.693260 +0300 1500480688 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/01/03/fe/0103fef5e9c71fd48065caa8f376e6c4920b8396c87019151db19479c313e13fc7e50b8856c59d674221b5d2aeade507b04c010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:11:48.595501 +0300 1500480708 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/05/2b/30/052b30c599682a3b8b328114610e406a5458b05102cfdda2656f4095589ae588ad3773646d0c78d843b30a441e2f58409815010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:12:07.456616 +0300 1500480727 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/05/2b/30/052b30c599682a3b8b328114610e406a5458b05102cfdda2656f4095589ae588ad3773646d0c78d843b30a441e2f58409815010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:12:34.88932 +0300 1500480754 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/01/79/02/0179021507a19bc291d357e2af7c8b384ed202399013a013cc4fe4ffdcba45c0ce4330af290e19ff07c382a74d0d403628fe010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:12:34.595173 +0300 1500480754 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/40/b6/57/40b657e019055bc1cbf9b2d27051d6c7b6cfe98cafcaac76bf346c7c782188e5e6b91df2fb4ee39a2cbbea020dfd9c3238226d0000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:12:34.605214 +0300 1500480754 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/40/b6/57/40b657e019055bc1cbf9b2d27051d6c7b6cfe98cafcaac76bf346c7c782188e5e6b91df2fb4ee39a2cbbea020dfd9c3238226d0000000000.xz\n1">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:12:35.591132 +0300 1500480755 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/40/b6/57/40b657e019055bc1cbf9b2d27051d6c7b6cfe98cafcaac76bf346c7c782188e5e6b91df2fb4ee39a2cbbea020dfd9c3238226d0000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:12:35.605678 +0300 1500480755 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/40/b6/57/40b657e019055bc1cbf9b2d27051d6c7b6cfe98cafcaac76bf346c7c782188e5e6b91df2fb4ee39a2cbbea020dfd9c3238226d0000000000.xz\n1">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:12:35.606239 +0300 1500480755 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/40/b6/57/40b657e019055bc1cbf9b2d27051d6c7b6cfe98cafcaac76bf346c7c782188e5e6b91df2fb4ee39a2cbbea020dfd9c3238226d0000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55 2017-07-19 19:12:52.98514 +0300 1500480772 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/01/79/02/0179021507a19bc291d357e2af7c8b384ed202399013a013cc4fe4ffdcba45c0ce4330af290e19ff07c382a74d0d403628fe010000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:13:20.645636 +0300 1500480800 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/0a/58/4d/0a584d8973d607dec93b1e71b4c3586c3fb15746de8a08dfc70a5bc4993c094629cf1077d7a1ecb6406f4f2ba8ca514d0600100000000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:13:26.602468 +0300 1500480806 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/ce/2b/8a/ce2b8ade1c0bbc0e92c163b287b5269b31357a57d6f23c8245ef36898fc5cbc395fca3acbecf2d1f14defd8dbdc2a495c4f1040100000000.xz\n3">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:13:26.603372 +0300 1500480806 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/ce/2b/8a/ce2b8ade1c0bbc0e92c163b287b5269b31357a57d6f23c8245ef36898fc5cbc395fca3acbecf2d1f14defd8dbdc2a495c4f1040100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:13:26.716346 +0300 1500480806 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/fa/52/75/fa5275f6be4d2615b754f06e7cf228db7e97dc05a123a953bfe6c2a1d0137c7c81be17f9838f8d41d0ec99e1c44106e600580a0100000000.xz\n1">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:13:26.717134 +0300 1500480806 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/fa/52/75/fa5275f6be4d2615b754f06e7cf228db7e97dc05a123a953bfe6c2a1d0137c7c81be17f9838f8d41d0ec99e1c44106e600580a0100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:13:27.596297 +0300 1500480807 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/ce/2b/8a/ce2b8ade1c0bbc0e92c163b287b5269b31357a57d6f23c8245ef36898fc5cbc395fca3acbecf2d1f14defd8dbdc2a495c4f1040100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:13:27.601163 +0300 1500480807 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/ce/2b/8a/ce2b8ade1c0bbc0e92c163b287b5269b31357a57d6f23c8245ef36898fc5cbc395fca3acbecf2d1f14defd8dbdc2a495c4f1040100000000.xz\n3">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:13:27.601912 +0300 1500480807 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/ce/2b/8a/ce2b8ade1c0bbc0e92c163b287b5269b31357a57d6f23c8245ef36898fc5cbc395fca3acbecf2d1f14defd8dbdc2a495c4f1040100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55 2017-07-19 19:13:27.715564 +0300 1500480807 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/fa/52/75/fa5275f6be4d2615b754f06e7cf228db7e97dc05a123a953bfe6c2a1d0137c7c81be17f9838f8d41d0ec99e1c44106e600580a0100000000.xz">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:13:27.716051 +0300 1500480807 leo_storage_replicator:replicate/5 123 [{method,delete},{key,<<"bodytest/fa/52/75/fa5275f6be4d2615b754f06e7cf228db7e97dc05a123a953bfe6c2a1d0137c7c81be17f9838f8d41d0ec99e1c44106e600580a0100000000.xz\n1">>},{cause,timeout}]
[W] storage_2@192.168.3.55 2017-07-19 19:13:27.716423 +0300 1500480807 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/fa/52/75/fa5275f6be4d2615b754f06e7cf228db7e97dc05a123a953bfe6c2a1d0137c7c81be17f9838f8d41d0ec99e1c44106e600580a0100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55 2017-07-19 19:13:34.914364 +0300 1500480814 leo_storage_replicator:loop/6 216 [{method,delete},{key,<<"bodytest/0a/58/4d/0a584d8973d607dec93b1e71b4c3586c3fb15746de8a08dfc70a5bc4993c094629cf1077d7a1ecb6406f4f2ba8ca514d0600100000000000.xz">>},{cause,timeout}]
[E] storage_2@192.168.3.55 2017-07-19 19:13:37.717090 +0300 1500480817 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,16,177,15,204,176,30,198,228,202,120,84,124,144,181,60,18,109,0,0,0,133,98,111,100,121,116,101,115,116,47,102,97,47,53,50,47,55,53,47,102,97,53,50,55,53,102,54,98,101,52,100,50,54,49,53,98,55,53,52,102,48,54,101,55,99,102,50,50,56,100,98,55,101,57,55,100,99,48,53,97,49,50,51,97,57,53,51,98,102,101,54,99,50,97,49,100,48,49,51,55,99,55,99,56,49,98,101,49,55,102,57,56,51,56,102,56,100,52,49,100,48,101,99,57,57,101,49,99,52,52,49,48,54,101,54,48,48,53,56,48,97,48,49,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
(this leo_delete_dir_queue_1_1
line is present only on storage_1 and storage_2, but all nodes failed to switch from "enqueueing" to "monitoring").
There is nothing in info logs except for original msg: enqueued
at start and lots of {cause,"slow operation"},{method,head}
. The "dequeued and removed" message is not present on any node.
I've deleted second bucket after that and the result was the same, no "dequeued and removed" message on any nodes, state at manager is stuck at "enqueuing".
There is another problem which I'm trying to confirm (maybe you could look at it from code side as well?), quite minor, but still. This is output of "du" command right after deletion of second bucket is finished:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_0@192.168.3.53
active number of objects: 0
total number of objects: 1354898
active size of objects: 0
total size of objects: 191667623380
ratio of active size: 0.0%
last compaction start: ____-__-__ __:__:__
last compaction end: ____-__-__ __:__:__
[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_1@192.168.3.54
active number of objects: 2035
total number of objects: 1509955
active size of objects: 235607379
total size of objects: 213421617475
ratio of active size: 0.11%
last compaction start: ____-__-__ __:__:__
last compaction end: ____-__-__ __:__:__
This is after compaction:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_0@192.168.3.53
active number of objects: 0
total number of objects: 0
active size of objects: 0
total size of objects: 0
ratio of active size: 0%
last compaction start: 2017-07-19 21:15:41 +0300
last compaction end: 2017-07-19 21:24:38 +0300
[root@leo-m0 ~]# /usr/local/bin/leofs-adm du storage_1@192.168.3.54
active number of objects: 0
total number of objects: 0
active size of objects: 0
total size of objects: 0
ratio of active size: 0%
last compaction start: 2017-07-19 21:17:59 +0300
last compaction end: 2017-07-19 21:27:16 +0300
It's as if counters for storage_1 got broken in process of deletion. I'm trying to verify if they were correct from the start right now.
On a positive side: the performance is way higher compared to old implementation, e.g. it takes less than 10 minutes to enqueue 600,000 deletes and 20 minutes or so to actually delete that data, deletes are mostly consumed at steady 400-500 messages per second (doesn't seem to be dependent on "mq interval" parameter). Also, 100% of objects were removed from both buckets.
EDIT: as an experiment, on the same cluster without objects at all (and deletes finished and compaction is performed, /mnt/avs
occupies less than 1 MB, all queues are empty as well) I've executed
[root@leo-m0 ~]# /usr/local/bin/leofs-adm reset-delete-bucket-stats bodytest
[root@leo-m0 ~]# /usr/local/bin/leofs-adm reset-delete-bucket-stats body
then on my system
$ s3cmd mb s3://bodytest
$ s3cmd rb s3://bodytest
The manager now shows
[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node | node | state
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55 | enqueuing | 2017-07-19 21:49:34 +0300
storage_1@192.168.3.54 | enqueuing | 2017-07-19 21:49:34 +0300
storage_0@192.168.3.53 | enqueuing | 2017-07-19 21:49:34 +0300
and that's it. Nothing in logs on storage nodes at all, not even "msg: enqueued" line! Also - I see minor spikes in CPU usage on nodes from time to time, here is leo_doctor log: https://pastebin.com/1KNpix5n
@vstax Thanks for testing.
Just doing "delete-bucket" here; "repeating" as in repeating the experiment because I rolled back managers as well to the state when "delete-bucket" was never executed.
Got it.
I've tried deleting the same bucket with these changes, and I don't think that fix for "finish notification" works; the state remains "enqueuing" even 1 hour after delete operation has finished. This happens for all storage nodes: and that's it. Nothing in logs on storage nodes at all, not even "msg: enqueued" line! Also - I see minor spikes in CPU usage on nodes from time to time, here is leo_doctor log: https://pastebin.com/1KNpix5n
Since there is a non backward compatible change around delete-stats handling on leo_storage, please also wipe out del_dir_queue on every leo_storage and try again with the clean state. That should work for you or please let me know.
On a positive side: the performance is way higher compared to old implementation, e.g. it takes less than 10 minutes to enqueue 600,000 deletes and 20 minutes or so to actually delete that data, deletes are mostly consumed at steady 400-500 messages per second (doesn't seem to be dependent on "mq interval" parameter). Also, 100% of objects were removed from both buckets.
Great to hear that :)
@mocchira
Thank you for suggestion. However, it looks like I'm still having the same problem. Actually, deleting queue/del_dir
definitely does help, for example in the last experiment - when I was trying to create and remove bucket on nodes without data - I started to get "enqueued" and "dequeued" messages and state at manager changes reliably. However, in the original experiment, where I do deletion of bigger bucket - the problem remains. I've removed queue/del_dir
before starting nodes and deleted bucket, here are details.
Logs for storage_0, info and error (here and for other nodes I removed all "slow operation" lines from info logs and "{method,delete} .. {cause,timeout}" from error logs as there were too many of them):
[I] storage_0@192.168.3.53 2017-07-20 19:26:43.86057 +0300 1500568003 leo_storage_handler_del_directory:run/5 141 [{"msg: enqueued",<<"bodytest">>}]
[I] storage_0@192.168.3.53 2017-07-20 20:05:34.652878 +0300 1500570334 leo_storage_handler_del_directory:run/5 575 [{"msg: dequeued and removed (bucket)",<<"bodytest">>}]
[W] storage_0@192.168.3.53 2017-07-20 19:30:18.378932 +0300 1500568218 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/00/da/65/00da6530e7432014585caaeae4af62590edfa53204fa5a0d9bc93fc61012eef837b7a62455faa9482aba94bce7b5012a805a5d0000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_0@192.168.3.53 2017-07-20 19:30:27.509573 +0300 1500568227 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/70/01/39/700139f980d397cff8b770b72b53ae91474a7abcb59df0b2d246bbde6178b3b7a173e2cc6b16cd13922f61a5a12609404003630000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[E] storage_0@192.168.3.53 2017-07-20 19:33:21.198370 +0300 1500568401 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:33:21.218434 +0300 1500568401 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:33:21.219006 +0300 1500568401 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.526.0> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:33:21.247327 +0300 1500568401 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_3,{remove,<<131,104,2,110,16,0,10,212,194,135,40,132,128,5,96,246,93,208,19,74,201,203,109,0,0,0,133,98,111,100,121,116,101,115,116,47,50,57,47,97,55,47,57,98,47,50,57,97,55,57,98,101,50,52,55,102,49,52,50,97,50,55,49,51,98,98,56,101,101,57,54,57,98,57,55,52,98,56,55,98,101,53,53,48,100,54,102,55,101,100,52,101,57,101,49,50,97,100,56,54,100,56,50,102,54,98,51,50,48,51,50,54,50,102,100,48,54,98,49,100,97,99,49,52,51,100,98,97,52,57,99,57,51,52,52,100,52,57,57,50,57,102,56,50,101,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53 2017-07-20 19:33:42.271389 +0300 1500568422 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:33:42.272357 +0300 1500568422 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:33:42.272940 +0300 1500568422 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.24218.8> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:34:03.565223 +0300 1500568443 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:34:03.565692 +0300 1500568443 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:34:03.566404 +0300 1500568443 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.7324.9> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:34:03.661557 +0300 1500568443 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_2,{remove,<<131,104,2,110,16,0,16,156,221,202,70,24,61,7,88,75,118,163,245,145,8,86,109,0,0,0,133,98,111,100,121,116,101,115,116,47,49,57,47,49,49,47,102,50,47,49,57,49,49,102,50,99,53,53,100,101,102,54,97,54,50,99,53,97,51,102,97,49,98,98,56,100,98,102,97,101,54,56,54,56,99,101,99,53,100,97,53,54,57,98,99,49,99,99,51,53,97,97,50,102,49,50,49,50,98,53,56,48,97,101,56,49,54,98,52,101,56,51,97,100,100,51,100,56,99,101,55,99,56,56,57,57,97,48,48,49,101,51,56,97,53,52,52,57,97,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53 2017-07-20 19:34:16.569535 +0300 1500568456 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:34:16.570040 +0300 1500568456 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:34:16.570461 +0300 1500568456 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.22497.9> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:34:16.576021 +0300 1500568456 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,21,138,133,115,192,150,177,167,10,16,74,198,96,170,11,205,109,0,0,0,133,98,111,100,121,116,101,115,116,47,100,100,47,98,55,47,97,100,47,100,100,98,55,97,100,49,98,97,48,49,48,48,53,53,50,49,48,49,48,100,98,51,101,51,48,97,100,99,57,55,50,102,49,98,55,51,55,54,102,49,97,51,52,56,100,55,48,55,51,97,98,51,101,55,99,52,54,50,57,102,49,51,101,49,48,97,97,102,50,98,51,55,101,54,56,102,48,49,100,49,52,53,99,56,99,48,49,99,97,49,98,102,100,99,98,48,48,102,48,48,51,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53 2017-07-20 19:34:47.99576 +0300 1500568487 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:34:47.100200 +0300 1500568487 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:34:47.100717 +0300 1500568487 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.31476.9> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:35:00.102687 +0300 1500568500 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:35:00.103131 +0300 1500568500 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:35:00.103849 +0300 1500568500 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.21760.10> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:35:00.104430 +0300 1500568500 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,24,7,173,240,79,26,10,158,192,167,84,162,9,108,241,135,109,0,0,0,133,98,111,100,121,116,101,115,116,47,100,48,47,50,51,47,48,99,47,100,48,50,51,48,99,97,49,54,53,56,50,99,100,50,57,102,101,55,99,52,97,50,98,54,97,55,56,101,49,57,54,102,102,49,55,56,99,98,55,52,49,99,50,56,99,53,99,48,99,50,52,98,102,99,52,98,52,98,100,100,100,50,100,102,53,100,52,50,52,99,54,98,99,50,99,99,100,52,99,99,102,52,50,52,57,52,55,99,48,52,54,48,97,52,51,53,56,55,52,48,49,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53 2017-07-20 19:35:41.305520 +0300 1500568541 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:35:41.306059 +0300 1500568541 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_4",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:35:41.306489 +0300 1500568541 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.32632.10> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:35:41.381384 +0300 1500568541 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_4,{remove,<<131,104,2,110,16,0,19,5,174,44,231,161,153,57,126,104,105,200,4,187,126,79,109,0,0,0,133,98,111,100,121,116,101,115,116,47,101,101,47,52,49,47,100,97,47,101,101,52,49,100,97,101,97,99,53,57,50,54,55,101,51,49,52,98,53,54,48,49,51,55,98,49,55,56,57,55,100,48,56,49,54,53,54,52,54,53,53,52,56,97,48,51,52,100,55,56,52,57,100,49,50,55,50,56,98,99,57,99,55,55,97,97,97,101,101,48,49,52,52,101,98,54,51,51,100,102,54,50,101,53,98,49,51,101,49,97,99,50,102,97,98,99,48,50,102,48,49,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53 2017-07-20 19:36:20.311567 +0300 1500568580 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:36:20.312132 +0300 1500568580 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_4",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:36:20.312568 +0300 1500568580 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.378.12> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:36:41.793761 +0300 1500568601 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:36:41.795603 +0300 1500568601 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:36:41.795910 +0300 1500568601 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.29776.12> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:36:54.805977 +0300 1500568614 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:36:54.806443 +0300 1500568614 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:36:54.807702 +0300 1500568614 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.11474.13> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:36:54.808481 +0300 1500568614 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,29,18,190,9,50,26,84,237,156,114,100,101,24,63,18,136,109,0,0,0,133,98,111,100,121,116,101,115,116,47,50,51,47,99,98,47,56,50,47,50,51,99,98,56,50,102,57,100,50,53,52,101,102,48,98,98,48,53,53,53,54,50,53,57,101,97,53,55,97,51,50,55,102,55,51,97,102,54,54,49,52,101,57,53,57,50,98,101,99,57,54,100,52,56,57,50,53,53,52,99,49,102,52,53,52,99,97,48,53,57,100,48,100,53,48,55,98,56,51,100,54,102,101,102,101,100,53,52,48,50,53,55,49,97,99,48,99,57,97,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53 2017-07-20 19:37:23.723342 +0300 1500568643 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:37:23.723937 +0300 1500568643 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:37:23.724411 +0300 1500568643 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.19939.13> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:37:43.835582 +0300 1500568663 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:37:43.836148 +0300 1500568663 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:37:43.837542 +0300 1500568663 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.9223.14> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:38:04.918906 +0300 1500568684 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:38:04.920371 +0300 1500568684 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:38:04.921132 +0300 1500568684 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.22232.14> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:38:26.438963 +0300 1500568706 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:38:26.441240 +0300 1500568706 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:38:26.441819 +0300 1500568706 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.5686.15> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:39:31.659653 +0300 1500568771 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:39:31.660134 +0300 1500568771 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:39:31.661283 +0300 1500568771 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.20661.15> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:40:01.773351 +0300 1500568801 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:40:01.773860 +0300 1500568801 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:40:01.774190 +0300 1500568801 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.4586.17> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:40:22.745567 +0300 1500568822 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:40:22.746069 +0300 1500568822 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:40:22.746673 +0300 1500568822 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.28351.17> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:40:42.852335 +0300 1500568842 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:40:42.852770 +0300 1500568842 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:40:42.853120 +0300 1500568842 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.11759.18> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:40:55.854837 +0300 1500568855 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:40:55.855802 +0300 1500568855 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:40:55.856360 +0300 1500568855 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.28083.18> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:40:55.858391 +0300 1500568855 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,42,18,96,104,151,33,21,195,99,36,241,255,203,123,187,233,109,0,0,0,133,98,111,100,121,116,101,115,116,47,54,100,47,54,53,47,51,56,47,54,100,54,53,51,56,102,54,98,49,48,49,97,102,97,57,102,51,54,54,98,98,49,99,97,101,51,49,99,102,52,102,53,101,99,51,102,53,54,55,57,101,49,50,100,97,50,101,55,49,102,49,52,48,51,51,101,102,54,51,48,98,50,52,56,50,52,51,51,53,56,53,55,98,52,51,102,51,98,53,97,98,102,99,102,101,101,48,54,98,56,48,49,48,53,48,100,99,53,97,48,49,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53 2017-07-20 19:41:08.860290 +0300 1500568868 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:41:08.860750 +0300 1500568868 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:41:08.861189 +0300 1500568868 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.3751.19> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:41:08.864488 +0300 1500568868 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,42,93,84,246,102,162,36,191,248,134,50,209,160,246,12,89,109,0,0,0,133,98,111,100,121,116,101,115,116,47,57,56,47,51,56,47,98,101,47,57,56,51,56,98,101,51,97,101,100,98,51,48,55,97,56,53,48,100,53,97,100,57,52,99,101,102,57,102,101,50,55,101,97,56,51,56,50,54,52,51,101,54,100,101,99,53,54,49,57,54,55,56,57,49,50,50,100,101,48,57,99,55,100,48,52,100,51,53,102,54,54,48,56,53,55,54,49,56,99,55,99,53,99,56,100,55,53,102,102,50,55,50,49,51,99,48,48,53,48,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53 2017-07-20 19:41:21.862490 +0300 1500568881 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:41:21.865461 +0300 1500568881 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:41:21.866002 +0300 1500568881 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.11986.19> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:41:21.868401 +0300 1500568881 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,42,253,124,3,146,58,153,107,170,73,196,200,107,124,70,12,109,0,0,0,133,98,111,100,121,116,101,115,116,47,53,97,47,53,98,47,56,49,47,53,97,53,98,56,49,56,50,99,98,102,53,101,53,98,50,49,99,49,57,102,49,49,57,55,50,52,99,50,48,55,98,97,56,54,97,48,52,101,55,97,97,55,102,49,97,98,55,102,52,56,100,99,101,53,97,48,50,98,48,99,55,52,50,51,48,50,53,99,49,57,54,98,97,53,97,101,48,57,97,99,50,49,54,56,53,100,101,55,97,49,51,101,52,54,53,48,48,56,50,100,97,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53 2017-07-20 19:41:43.318645 +0300 1500568903 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:41:43.319131 +0300 1500568903 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:41:43.319606 +0300 1500568903 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.20867.19> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:42:13.168277 +0300 1500568933 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:42:13.168686 +0300 1500568933 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:42:13.169027 +0300 1500568933 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.2797.20> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:42:44.42270 +0300 1500568964 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:42:44.42725 +0300 1500568964 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_3",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:42:44.43206 +0300 1500568964 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.22965.20> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_3,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:42:44.116878 +0300 1500568964 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_3,{remove,<<131,104,2,110,16,0,41,233,152,250,25,234,185,3,159,249,195,75,6,193,234,149,109,0,0,0,133,98,111,100,121,116,101,115,116,47,53,99,47,55,99,47,48,101,47,53,99,55,99,48,101,102,56,52,98,97,51,55,57,53,56,101,57,102,52,55,101,54,56,54,54,55,51,48,51,54,50,100,53,52,99,57,55,48,98,99,49,100,55,97,56,52,102,55,99,99,55,55,102,57,53,56,98,99,53,49,98,100,48,51,52,53,54,52,49,53,49,49,100,100,52,99,97,101,48,102,55,49,55,53,56,49,97,52,51,50,57,48,54,51,97,49,56,99,48,48,48,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53 2017-07-20 19:43:43.107190 +0300 1500569023 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:43:43.107555 +0300 1500569023 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:43:43.107844 +0300 1500569023 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.12801.21> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:44:55.487350 +0300 1500569095 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:44:55.488374 +0300 1500569095 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_4",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:44:55.490281 +0300 1500569095 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.27006.22> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:45:44.440394 +0300 1500569144 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:45:44.446359 +0300 1500569144 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_1",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:45:44.450574 +0300 1500569144 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.16498.24> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_1,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:45:44.454278 +0300 1500569144 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,57,247,169,22,234,112,135,7,148,188,119,50,58,181,122,117,109,0,0,0,133,98,111,100,121,116,101,115,116,47,53,53,47,54,54,47,101,102,47,53,53,54,54,101,102,97,53,53,56,52,100,97,48,97,51,50,53,57,102,101,57,98,48,99,57,100,57,102,97,98,57,52,57,55,98,99,49,57,99,102,49,53,50,98,100,56,50,51,53,101,102,49,55,53,50,54,56,51,53,99,49,97,102,101,57,55,55,56,100,50,101,102,97,57,54,48,101,98,50,56,97,97,97,54,49,56,102,55,50,54,101,97,51,52,99,57,52,52,56,48,49,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
[E] storage_0@192.168.3.53 2017-07-20 19:49:44.541355 +0300 1500569384 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:49:44.542521 +0300 1500569384 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_2",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:49:44.543099 +0300 1500569384 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.23966.25> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_2,count,10000]}} in context child_terminated
[E] storage_0@192.168.3.53 2017-07-20 19:52:08.383527 +0300 1500569528 null:null 0 gen_server leo_storage_handler_del_directory terminated with reason: {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in gen_server:call/3 line 212
[E] storage_0@192.168.3.53 2017-07-20 19:52:08.384080 +0300 1500569528 null:null 0 ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_storage_handler_del_directory",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[[123,["timeout",44,[123,["gen_server",44,"call",44,[91,["leo_delete_dir_queue_1_4",44,"count",44,"10000"],93]],125]],125]," in ",[["gen_server",58,"terminate",47,"7"],[32,108,105,110,101,32,"812"]]]]]
[E] storage_0@192.168.3.53 2017-07-20 19:52:08.384527 +0300 1500569528 null:null 0 Supervisor leo_storage_sup had child undefined started with leo_storage_handler_del_directory:start_link() at <0.15232.31> exit with reason {timeout,{gen_server,call,[leo_delete_dir_queue_1_4,count,10000]}} in context child_terminated
Logs for storage_1:
[I] storage_1@192.168.3.54 2017-07-20 19:26:43.56260 +0300 1500568003 leo_storage_handler_del_directory:run/5 141 [{"msg: enqueued",<<"bodytest">>}]
[I] storage_1@192.168.3.54 2017-07-20 19:29:40.188688 +0300 1500568180 null:null 0 ["alarm_handler",58,32,"{set,{system_memory_high_watermark,[]}}"]
[W] storage_1@192.168.3.54 2017-07-20 19:28:18.878858 +0300 1500568098 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/52/1d/82/521d82ec1c121d384f0ae426745289d7d48359eab6c2cd99c176419fd897a17875d3f2f6fef5584dea4f130504d90fee4866770000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-20 19:28:57.779793 +0300 1500568137 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/e0/79/34/e079343a8d0f00c9374bbebe2c35eaa257511cc452e016f237ba54cf3458a817b3f57cad72a96859fb0e11c80b2e1ade03f1040100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-20 19:29:36.167953 +0300 1500568176 leo_storage_handler_object:replicate_fun/3 1406 [{cause,"Could not get a metadata"}]
[E] storage_1@192.168.3.54 2017-07-20 19:29:36.169588 +0300 1500568176 leo_storage_handler_object:put/4 423 [{from,storage},{method,delete},{key,<<"bodytest/f7/68/1f/f7681f320d1c516981326302016b6fcabf63e441cf19cc373d482bb7a4cb531d7d76e7b7fbfa1079b96749123d43c85808daa40000000000.xz\n1">>},{req_id,0},{cause,"Could not get a metadata"}]
[W] storage_1@192.168.3.54 2017-07-20 19:29:37.760044 +0300 1500568177 leo_storage_handler_object:replicate_fun/3 1406 [{cause,"Could not get a metadata"}]
[E] storage_1@192.168.3.54 2017-07-20 19:29:37.760725 +0300 1500568177 leo_storage_handler_object:put/4 423 [{from,storage},{method,delete},{key,<<"bodytest/de/68/50/de6850f34b4e62d52496f47186e20a4932c0c570c961c24262543031d73275707da1710e2475f1b0c1f9c3cbbb273ca8d3dc020100000000.xz\n4">>},{req_id,0},{cause,"Could not get a metadata"}]
[W] storage_1@192.168.3.54 2017-07-20 19:29:40.642703 +0300 1500568180 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/e5/a3/84/e5a384f216d906661267a25f00c7e91f69d61e0b628d527940226028734d07dd7fc12da6dcb17936a2ac13ff1a15087248d5df0000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-20 19:30:21.26429 +0300 1500568221 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/55/dd/da/55dddad03e531499b6844655a7621ad3c1fa4fd1653de1187d2055a5d4edc719250d77b7311852571820300783c73ee2f7fb030100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-20 19:30:23.334339 +0300 1500568223 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/cb/f1/83/cbf183675706473d4b747c12281bf50fc541a238ca442b2e14fad0eb71476657aa0cfa79aed45ab98f195745aabf8885b843580000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-20 19:31:12.979853 +0300 1500568272 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/9d/94/18/9d9418650983a9e30bd7f13a673ca730b2f0d4768c9219216295d63dcee018c0b987bc4953cf8ae704008d8453ba54eba8fc700000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-20 19:31:22.553110 +0300 1500568282 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/c1/cb/49/c1cb49681fc5b5faecef846d33ac51381a3072bf18f7ddb96cb4e460124d6533f1fdf7d85bc6a85fb5236068b552184bbfc7020100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_1@192.168.3.54 2017-07-20 19:32:04.918653 +0300 1500568324 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/8f/8d/7c/8f8d7c273b8eadb4aeab29726ff21a5f39e5f75ef05ce9185afd769cc9f22bb8f0be0780dc30f595fc7e350dd946ffa880e73f0100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[E] storage_1@192.168.3.54 2017-07-20 19:32:20.672358 +0300 1500568340 leo_mq_consumer:consume/4 526 [{module,leo_storage_mq},{id,leo_delete_dir_queue_1},{cause,{timeout,{gen_server,call,[leo_delete_dir_queue_1_1,{remove,<<131,104,2,110,16,0,18,38,11,47,88,189,190,69,27,89,90,85,231,238,224,105,109,0,0,0,133,98,111,100,121,116,101,115,116,47,100,51,47,98,50,47,50,53,47,100,51,98,50,50,53,100,50,100,52,101,57,100,51,48,102,55,100,50,51,98,98,53,99,49,100,51,102,98,102,48,49,50,102,50,98,51,56,56,100,49,57,50,101,49,97,101,56,98,51,98,48,55,48,100,101,97,52,98,56,97,48,53,57,102,98,53,100,53,48,56,53,50,51,50,100,49,56,51,57,97,98,98,51,97,49,54,56,99,101,51,100,50,51,53,97,100,99,50,98,48,50,48,48,48,48,48,48,48,48,48,48,46,120,122>>},10000]}}}]
Logs for storage_2:
[I] storage_2@192.168.3.55 2017-07-20 19:26:43.65430 +0300 1500568003 leo_storage_handler_del_directory:run/5 141 [{"msg: enqueued",<<"bodytest">>}]
[W] storage_2@192.168.3.55 2017-07-20 19:28:56.735586 +0300 1500568136 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz\n2">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_3,{get,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz\n2">>},30000]}}}]
[W] storage_2@192.168.3.55 2017-07-20 19:28:57.727733 +0300 1500568137 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/aa/45/fb/aa45fb9aac3eb1d0ef733e0b79474933eeeff7d9365aed9caf2b6eba5231522a164e4e323cf507c681f0058d7eb8cb4540e3620000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55 2017-07-20 19:31:17.784960 +0300 1500568277 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/40/b6/57/40b657e019055bc1cbf9b2d27051d6c7b6cfe98cafcaac76bf346c7c782188e5e6b91df2fb4ee39a2cbbea020dfd9c3238226d0000000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
[W] storage_2@192.168.3.55 2017-07-20 19:31:56.121550 +0300 1500568316 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/ef/98/4b/ef984b782734a9509889ccc4e318767bdcdeb8509e47612d87f175836c9be96d63e1a9624a05662379b356b486203430b99e640000000000.xz\n2">>},{node,local},{req_id,0},{cause,{timeout,{gen_server,call,[leo_metadata_7,{get,<<"bodytest/ef/98/4b/ef984b782734a9509889ccc4e318767bdcdeb8509e47612d87f175836c9be96d63e1a9624a05662379b356b486203430b99e640000000000.xz\n2">>},30000]}}}]
[W] storage_2@192.168.3.55 2017-07-20 19:32:03.179147 +0300 1500568323 leo_storage_replicator:replicate_fun/2 243 [{key,<<"bodytest/ce/2b/8a/ce2b8ade1c0bbc0e92c163b287b5269b31357a57d6f23c8245ef36898fc5cbc395fca3acbecf2d1f14defd8dbdc2a495c4f1040100000000.xz">>},{node,local},{req_id,0},{cause,"Replicate failure"}]
The problem: for storage_1 and storage_2 state never changed from "enqueued". For example, here are stats when bucket deletion was basically done on storage_1 and storage_2 (queues were empty, "du" showing ratio of active size around 50%) but there was still ~150K messages in storage_0 queue:
[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node | node | state
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55 | enqueuing | 2017-07-20 19:58:13 +0300
storage_1@192.168.3.54 | enqueuing | 2017-07-20 19:58:11 +0300
storage_0@192.168.3.53 | monitoring | 2017-07-20 19:58:07 +0300
This is current state (all queues are empty):
[root@leo-m0 ~]# /usr/local/bin/leofs-adm delete-bucket-stats
- Bucket: bodytest
node | node | state
-----------------------------+------------------+-----------------------------
storage_2@192.168.3.55 | enqueuing | 2017-07-20 20:34:54 +0300
storage_1@192.168.3.54 | enqueuing | 2017-07-20 20:34:55 +0300
storage_0@192.168.3.53 | finished | 2017-07-20 20:05:34 +0300
EDIT: I deleted another bucket and it's the same situation, storage_0 "finished" while storage_1 and 2 stuck in "enqueuing". I packed "del_dir" contents on storage_1 after that and uploaded to https://www.dropbox.com/s/5qckg363tuuhcow/del_dir.tar.gz?dl=0 - will that be helpful? (do you use some tool to dump contents of these queues? I tried https://github.com/tgulacsi/leveldb-tools but it complains about corrupted "zero header" on MANIFEST file)
I've also confirmed that my assumption from previous comment (https://github.com/leo-project/leofs/issues/725#issuecomment-316479835) is correct and counters on storage_1 really went off during that bucket deletion experiment; they were fine on storage_0, however. I rolled back to original state and did compaction on storage_1 to verify counters, and values were the same. That means that there was no error before the experiment, and the error (2035 remaining objects per counters, while in reality there were 0 left) appeared during the course of bucket deletion.
Another note: looks like delete-bucket-stats automatically wipes the stats some time after state is "finished" for all nodes (not sure when, 1 or 2 hours maybe?). Which is fine but probably should be mentioned in documentation for that command so that no one gets confused?
I got test cluster (1.3.4, 3 storage nodes, N=2, W=1). There are two buckets, "body" and "bodytest", each containing the same objects, about 1M in each (there are some other buckets as well, but they hardly contain anything). In other words, there are slightly over 2M objects in cluster in total. At the start of this test the data is fully consistent. There is some minor load on cluster with "body" buckets - some PUT & GET operations, but very few of them. No one tries to access "bodytest" bucket.
I want to remove "bodytest" with all its objects. I execute
s3cmd rb s3://bodytest
. I see load on gateway and storage nodes; after some time, s3cmd fails because of timeout (I expect this to happen, no way storage can find all 1M objects and mark them as deleted fast enough). I seeleo_async_deletion_queue
queues growing on storage nodes:[root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue leo_async_deletion_queue | idling | 97845 | 0 | 3000 | async deletion of objs [root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue leo_async_deletion_queue | idling | 102780 | 0 | 3000 | async deletion of objs [root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue leo_async_deletion_queue | idling | 104911 | 0 | 3000 | async deletion of objs [root@leo-m0 ~]# /usr/local/bin/leofs-adm mq-stats storage_2@192.168.3.55|grep leo_async_deletion_queue leo_async_deletion_queue | idling | 108396 | 0 | 3000 | async deletion of objs
The same for storage_0 and storage_1. There is hardly any disk load, each storage nodes consumes 120-130% CPU as per
top
.Then some errors appear in error log on gateway_0:
If these errors mean that gateway sent "timeout" to the client that requested "delete bucket" operation, plus some other timeouts due to load on system - then it's within expectations; as long as all data from that bucket will eventually be marked as "deleted" asynchronously, all is fine.
That's not what happens, however. At some point - few minutes after the "delete bucket" operation - delete queues stop growing or reducing. It's as if they are stuck. Here is their current state - 1.5 hours after the experiment; they got to that state within 5-10 minutes after start of experiment and never changed since (I show only one queue here, others are empty):
There is nothing in logs of manager nodes. There is nothing in
erlang.log
files on storage nodes (no mention of restarts or anything). Logs on storage nodes, info log for storage_0:error log on storage_0:
Info log on storage_1:
error log on storage_1:
Info log on storage_2:
Error log on storage_2:
To summarize the problems:
and on storage_0:
What happened here - it's that "minor load" that I mentioned. Basically at 17:17:13 application tried to do PUT operation of object
body/08/08/e2/0808e2f9815aa7d4b9c92b01db4fa208344d063d83c151b5349095f4004b60e1684056dab62f9bc2b51ac0e6b3721ca6a807010000000000.xz
. That's a very small object, 27 KB in size. Some moments after successful PUT few (5, I believe) other applications did GET for that object. However, they all were using the same gateway with caching enabled, so they should've gotten the object from memory cache (at worst gateway would've checked ETag against storage node). 17:17:13 was in the middle of "delete bucket" operation, so I suppose the fact that there was large "processing time" for PUT was expected. But why "read_repairer" errors and "primary_inconsistency"?? Storage_0 is "primary" node for this object: