Open NewDund opened 5 years ago
@NewDund The error enotconn we can see in crash.log will happen in case the connection is NOT established (for TCP connection) so I guess some network trouble happened around that time and some connections were forcefully closed by network switches placed between LeoGateway and LeoStorage. So I'd recommend you contact your infra team (OR if your run LeoFS on some public cloud then contact the provider) to confirm if something went wrong around that time.
If you need any information, please contact me.
Just in case, please give us the error log files on LeoStorage if exists.
And also any system metrics related to network would be helpful if your system gather those information regularly.
@mocchira This is part of the error log of the storage node I intercepted, and you can see if it's helpful.
[W] storage_01@10.0.1.106 2018-12-19 18:08:22.866362 +0800 1545214102 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:23.62127 +0800 1545214103 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:24.547035 +0800 1545214104 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,115042698958933531525465566678727684218},{key,<<"xz-img-01/51,0,96,15055,1500,2000,87fc2486.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:24.567591 +0800 1545214104 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,85522489140579400833319482861472971630},{key,<<"xz-img-01/51,0,6,14972,3038,2000,d559c8f4.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:29.645343 +0800 1545214109 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,322931588457243899315235525521163488474},{key,<<"xz-img-01/51,0,58,14908,1616,1080,4b3b45e9.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:30.740829 +0800 1545214110 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,301825275647700918725340957232212760431},{key,<<"xz-img-01/51,0,3,15066,2666,2000,74b05a3c.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:31.836745 +0800 1545214111 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,302704449271088706685579725248233407195},{key,<<"xz-img-01/51,0,47,15060,800,800,842d3c77.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:31.840704 +0800 1545214111 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,302704449271088706685579725248233407195},{key,<<"xz-img-01/51,0,47,15060,800,800,842d3c77.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:31.935687 +0800 1545214111 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,200868615888498000502667266991659758280},{key,<<"xz-img-01/51,0,57,14944,1500,2000,500e72e8.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:32.722483 +0800 1545214112 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,302704449271088706685579725248233407195},{key,<<"xz-img-01/51,0,47,15060,800,800,842d3c77.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:34.658964 +0800 1545214114 leo_storage_read_repairer:compare/4 174 [{node,'storage_03@10.0.1.108'},{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{clock,1545214100177830},{cause,secondary_inconsistency}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:34.661988 +0800 1545214114 leo_storage_read_repairer:compare/4 174 [{node,'storage_03@10.0.1.108'},{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{clock,1545214100177830},{cause,secondary_inconsistency}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:34.662048 +0800 1545214114 leo_storage_read_repairer:compare/4 174 [{node,'storage_03@10.0.1.108'},{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{clock,1545214100177830},{cause,secondary_inconsistency}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:35.173043 +0800 1545214115 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,303555783980173236865626863861593103855},{key,<<"xz-img-01/51,0,40,15027,1024,682,54cd19b0.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:35.193046 +0800 1545214115 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,303555783980173236865626863861593103855},{key,<<"xz-img-01/51,0,40,15027,1024,682,54cd19b0.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:35.362758 +0800 1545214115 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:35.699131 +0800 1545214115 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,132627485593750080073349362187704947564},{key,<<"xz-img-01/51,0,95,14931,1800,1200,dcbe253e.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:35.945801 +0800 1545214115 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,45662221158025527418605584447911674797},{key,<<"xz-img-01/51,0,37,15070,1500,2000,c8d4e3e2.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:35.950816 +0800 1545214115 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,45662221158025527418605584447911674797},{key,<<"xz-img-01/51,0,37,15070,1500,2000,c8d4e3e2.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:35.964784 +0800 1545214115 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,45662221158025527418605584447911674797},{key,<<"xz-img-01/51,0,37,15070,1500,2000,c8d4e3e2.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:38.419244 +0800 1545214118 leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,274799762605490884010058104327502159603},{key,<<"xz-img-01/51,0,79,15227,3000,2000,aa740cb9.jpg">>},{clock,1545214072066214},{cause,timeout}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:38.433349 +0800 1545214118 leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,274799762605490884010058104327502159603},{key,<<"xz-img-01/51,0,79,15227,3000,2000,aa740cb9.jpg">>},{clock,1545214072066214},{cause,timeout}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:38.758215 +0800 1545214118 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,139114591507114234479358787829134935318},{key,<<"xz-img-01/51,0,11,14976,132,132,5030263d.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:38.799226 +0800 1545214118 leo_storage_read_repairer:compare/4 174 [{node,'storage_05@10.0.1.110'},{addr_id,200868615888498000502667266991659758280},{key,<<"xz-img-01/51,0,57,14944,1500,2000,500e72e8.jpg">>},{clock,1545214103039150},{cause,secondary_inconsistency}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:39.25278 +0800 1545214119 leo_storage_read_repairer:compare/4 174 [{node,'storage_03@10.0.1.108'},{addr_id,46555256417576372219622714676382478038},{key,<<"xz-img-01/51,0,58,3146,1334,1334,34132f53.jpg">>},{clock,1543114132553194},{cause,timeout}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:39.494405 +0800 1545214119 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,322931588457243899315235525521163488474},{key,<<"xz-img-01/51,0,58,14908,1616,1080,4b3b45e9.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:44.970543 +0800 1545214124 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,200868615888498000502667266991659758280},{key,<<"xz-img-01/51,0,57,14944,1500,2000,500e72e8.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:44.971314 +0800 1545214124 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,200868615888498000502667266991659758280},{key,<<"xz-img-01/51,0,57,14944,1500,2000,500e72e8.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:45.526425 +0800 1545214125 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:45.540390 +0800 1545214125 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:45.553651 +0800 1545214125 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:45.566184 +0800 1545214125 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:45.574858 +0800 1545214125 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,218849011693645702606343196622539469801},{key,<<"xz-img-01/51,0,86,15030,1500,2000,bf93d6ac.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:48.393474 +0800 1545214128 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:48.844323 +0800 1545214128 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:48.844409 +0800 1545214128 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:48.844456 +0800 1545214128 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:48.845286 +0800 1545214128 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:48.875203 +0800 1545214128 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:49.205864 +0800 1545214129 leo_storage_read_repairer:compare/4 174 [{node,'storage_05@10.0.1.110'},{addr_id,322931588457243899315235525521163488474},{key,<<"xz-img-01/51,0,58,14908,1616,1080,4b3b45e9.jpg">>},{clock,1545214109131558},{cause,primary_inconsistency}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:49.506328 +0800 1545214129 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,218849011693645702606343196622539469801},{key,<<"xz-img-01/51,0,86,15030,1500,2000,bf93d6ac.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:49.636818 +0800 1545214129 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,45662221158025527418605584447911674797},{key,<<"xz-img-01/51,0,37,15070,1500,2000,c8d4e3e2.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106 2018-12-19 18:08:50.353193 +0800 1545214130 leo_sync_remote_cluster:defer_stack_1/1 86 [{addr_id,132627485593750080073349362187704947564},{key,<<"xz-img-01/51,0,95,14931,1800,1200,dcbe253e.jpg">>},{cause,sending_data_to_remote}]
@mocchira
Here is my traffic limit order by tc.
/usr/sbin/tc qdisc add dev eno16777984 root tbf rate 150Mbit latency 60ms burst 15000
This restriction is a safe amount for my intranet communications. But I don't know how large the leofs can be used. Can you give me some advice?
@NewDund I can see many error lines including the word sending_data_to_remote. So let me confirm that under a normal circumstance, does this error line appear with such a high frequency rate? Since sending_data_to_remote means that "LeoStorage is trying to do MDCR with remote clusters so now it's impossible to take another replication tasks, try again after a certain period of time", MDCR seems to become bottleneck and that might cause LeoStorage to become high load situation.
One possible reason why MDCR becomes bottleneck is your bandwidth setting 150Mbit so please check the bandwidth actually used at the time the incident happened. if the network bandwidth is occupied then you may have to consider setting more larger number like 200 - 300Mbit to the bandwidth setting.
@mocchira You mean that both my local and remote cluster bandwidth restrictions should be increased, right?
@mocchira I checked it according to your prompt and found that there might be network fluctuation between our local cluster and remote cluster at that time. (Of course, I'm not entirely sure)
The main point I want to say is that local clusters and remote clusters should theoretically be independent of each other, and nothing happens to one cluster should affect the other. That is to say, even if the sudden failure of communication between two clusters leads to the failure of replicating objects or restoring clusters, it should not affect the normal access requests of another cluster.
I think this should be the focus of optimization. NEED FIX ~
@NewDund
I checked it according to your prompt and found that there might be network fluctuation between our local cluster and remote cluster at that time. (Of course, I'm not entirely sure)
Got it then probably the cause would be network related things.
The main point I want to say is that local clusters and remote clusters should theoretically be independent of each other, and nothing happens to one cluster should affect the other. That is to say, even if the sudden failure of communication between two clusters leads to the failure of replicating objects or restoring clusters, it should not affect the normal access requests of another cluster.
Right. The current retry mechanism we are using at any asynchronous tasks including MDCR isn't optimized sufficiently (There are rooms for improvements) so we are going to improve it in the future release.
Well, anyway. It is hoped that the related issues will be seen in the description of future version updates.
Yes we will. At first, I will file the issue related MDCR to deal with the situation in which the network bandwidth between both clusters is unstable.
@NewDund also I will share the recommended configuration which might contribute to make this kind of incident less likely to happen compared to the default settings. Please wait for a while.
@NewDund The setting mdc_replication.size_of_stacked_objs in leo_storage.conf explained on https://leo-project.net/leofs/docs/admin/settings/leo_storage/ would be to control how much amount of data is transferred to the remote cluster at once. The default is around 32MB. that said, this number probably would be too large for your environment (capped 150Mbit) so I'd recommend you lower this setting such as 16MB (16777216). This might prevent the same kind of incident from happening again.
Okay, thank you, but we're going to stop MDCR first, and then use self-written programs to manually synchronize the two clusters.
Then if the official has a more perfect solution, we will test it again.
@NewDund
Okay, thank you, but we're going to stop MDCR first, and then use self-written programs to manually synchronize the two clusters.
OK. little bit sad for us but your service stability should be the matter prior to anything else.
Then if the official has a more perfect solution, we will test it again.
Thanks! I'm sure we will be able to improve the MDCR stablity thanks to your feedback so once we are confident to meet your requirements, we'll get back to you.
@mocchira My cluster has almost encountered a thorny problem. My LB has returned a lot of 503 and 504. My LB and network management nodes have set very large timeouts.
Let's take a look at my log.
The following is the result returned by my leofs mq-stats:
==========
storage_01
id | state | number of msgs | batch of msgs | interval | description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
leo_async_deletion_queue | idling | 0 | 100 | 1000 | requests of removing objects asynchronously
leo_comp_meta_with_dc_queue | idling | 0 | 100 | 1000 | requests of comparing metadata w/remote-node
leo_delete_dir_queue_1 | idling | 0 | 100 | 1000 | requests of removing buckets #1
leo_delete_dir_queue_2 | idling | 0 | 100 | 1000 | requests of removing buckets #2
leo_delete_dir_queue_3 | idling | 0 | 100 | 1000 | requests of removing buckets #3
leo_delete_dir_queue_4 | idling | 0 | 100 | 1000 | requests of removing buckets #4
leo_delete_dir_queue_5 | idling | 0 | 100 | 1000 | requests of removing buckets #5
leo_delete_dir_queue_6 | idling | 0 | 100 | 1000 | requests of removing buckets #6
leo_delete_dir_queue_7 | idling | 0 | 100 | 1000 | requests of removing buckets #7
leo_delete_dir_queue_8 | idling | 0 | 100 | 1000 | requests of removing buckets #8
leo_per_object_queue | running | 365 | 100 | 1000 | requests of fixing inconsistency of objects
leo_rebalance_queue | idling | 0 | 100 | 1000 | requests of relocating objects
leo_recovery_node_queue | idling | 0 | 100 | 1000 | requests of recovering objects of the node (incl. recover-consistency)
leo_req_delete_dir_queue | idling | 0 | 100 | 1000 | requests of removing directories
leo_sync_by_vnode_id_queue | idling | 0 | 100 | 1000 | requests of synchronizing objects by vnode-id
leo_sync_obj_with_dc_queue | idling | 24 | 100 | 1000 | requests of synchronizing objects w/remote-node
==========
storage_02
id | state | number of msgs | batch of msgs | interval | description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
leo_async_deletion_queue | idling | 0 | 100 | 1000 | requests of removing objects asynchronously
leo_comp_meta_with_dc_queue | idling | 0 | 100 | 1000 | requests of comparing metadata w/remote-node
leo_delete_dir_queue_1 | idling | 0 | 100 | 1000 | requests of removing buckets #1
leo_delete_dir_queue_2 | idling | 0 | 100 | 1000 | requests of removing buckets #2
leo_delete_dir_queue_3 | idling | 0 | 100 | 1000 | requests of removing buckets #3
leo_delete_dir_queue_4 | idling | 0 | 100 | 1000 | requests of removing buckets #4
leo_delete_dir_queue_5 | idling | 0 | 100 | 1000 | requests of removing buckets #5
leo_delete_dir_queue_6 | idling | 0 | 100 | 1000 | requests of removing buckets #6
leo_delete_dir_queue_7 | idling | 0 | 100 | 1000 | requests of removing buckets #7
leo_delete_dir_queue_8 | idling | 0 | 100 | 1000 | requests of removing buckets #8
leo_per_object_queue | running | 245 | 100 | 1000 | requests of fixing inconsistency of objects
leo_rebalance_queue | idling | 0 | 100 | 1000 | requests of relocating objects
leo_recovery_node_queue | idling | 0 | 100 | 1000 | requests of recovering objects of the node (incl. recover-consistency)
leo_req_delete_dir_queue | idling | 0 | 100 | 1000 | requests of removing directories
leo_sync_by_vnode_id_queue | idling | 0 | 100 | 1000 | requests of synchronizing objects by vnode-id
leo_sync_obj_with_dc_queue | idling | 10 | 100 | 1000 | requests of synchronizing objects w/remote-node
==========
storage_03
id | state | number of msgs | batch of msgs | interval | description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
leo_async_deletion_queue | idling | 0 | 100 | 1000 | requests of removing objects asynchronously
leo_comp_meta_with_dc_queue | idling | 0 | 100 | 1000 | requests of comparing metadata w/remote-node
leo_delete_dir_queue_1 | idling | 0 | 100 | 1000 | requests of removing buckets #1
leo_delete_dir_queue_2 | idling | 0 | 100 | 1000 | requests of removing buckets #2
leo_delete_dir_queue_3 | idling | 0 | 100 | 1000 | requests of removing buckets #3
leo_delete_dir_queue_4 | idling | 0 | 100 | 1000 | requests of removing buckets #4
leo_delete_dir_queue_5 | idling | 0 | 100 | 1000 | requests of removing buckets #5
leo_delete_dir_queue_6 | idling | 0 | 100 | 1000 | requests of removing buckets #6
leo_delete_dir_queue_7 | idling | 0 | 100 | 1000 | requests of removing buckets #7
leo_delete_dir_queue_8 | idling | 0 | 100 | 1000 | requests of removing buckets #8
leo_per_object_queue | running | 336 | 100 | 1000 | requests of fixing inconsistency of objects
leo_rebalance_queue | idling | 0 | 100 | 1000 | requests of relocating objects
leo_recovery_node_queue | idling | 0 | 100 | 1000 | requests of recovering objects of the node (incl. recover-consistency)
leo_req_delete_dir_queue | idling | 0 | 100 | 1000 | requests of removing directories
leo_sync_by_vnode_id_queue | idling | 0 | 100 | 1000 | requests of synchronizing objects by vnode-id
leo_sync_obj_with_dc_queue | idling | 11 | 100 | 1000 | requests of synchronizing objects w/remote-node
==========
storage_04
id | state | number of msgs | batch of msgs | interval | description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
leo_async_deletion_queue | idling | 0 | 100 | 1000 | requests of removing objects asynchronously
leo_comp_meta_with_dc_queue | idling | 0 | 100 | 1000 | requests of comparing metadata w/remote-node
leo_delete_dir_queue_1 | idling | 0 | 100 | 1000 | requests of removing buckets #1
leo_delete_dir_queue_2 | idling | 0 | 100 | 1000 | requests of removing buckets #2
leo_delete_dir_queue_3 | idling | 0 | 100 | 1000 | requests of removing buckets #3
leo_delete_dir_queue_4 | idling | 0 | 100 | 1000 | requests of removing buckets #4
leo_delete_dir_queue_5 | idling | 0 | 100 | 1000 | requests of removing buckets #5
leo_delete_dir_queue_6 | idling | 0 | 100 | 1000 | requests of removing buckets #6
leo_delete_dir_queue_7 | idling | 0 | 100 | 1000 | requests of removing buckets #7
leo_delete_dir_queue_8 | idling | 0 | 100 | 1000 | requests of removing buckets #8
leo_per_object_queue | running | 858 | 100 | 1000 | requests of fixing inconsistency of objects
leo_rebalance_queue | idling | 0 | 100 | 1000 | requests of relocating objects
leo_recovery_node_queue | idling | 0 | 100 | 1000 | requests of recovering objects of the node (incl. recover-consistency)
leo_req_delete_dir_queue | idling | 0 | 100 | 1000 | requests of removing directories
leo_sync_by_vnode_id_queue | idling | 0 | 100 | 1000 | requests of synchronizing objects by vnode-id
leo_sync_obj_with_dc_queue | idling | 0 | 100 | 1000 | requests of synchronizing objects w/remote-node
==========
storage_05
id | state | number of msgs | batch of msgs | interval | description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
leo_async_deletion_queue | idling | 0 | 100 | 1000 | requests of removing objects asynchronously
leo_comp_meta_with_dc_queue | idling | 0 | 100 | 1000 | requests of comparing metadata w/remote-node
leo_delete_dir_queue_1 | idling | 0 | 100 | 1000 | requests of removing buckets #1
leo_delete_dir_queue_2 | idling | 0 | 100 | 1000 | requests of removing buckets #2
leo_delete_dir_queue_3 | idling | 0 | 100 | 1000 | requests of removing buckets #3
leo_delete_dir_queue_4 | idling | 0 | 100 | 1000 | requests of removing buckets #4
leo_delete_dir_queue_5 | idling | 0 | 100 | 1000 | requests of removing buckets #5
leo_delete_dir_queue_6 | idling | 0 | 100 | 1000 | requests of removing buckets #6
leo_delete_dir_queue_7 | idling | 0 | 100 | 1000 | requests of removing buckets #7
leo_delete_dir_queue_8 | idling | 0 | 100 | 1000 | requests of removing buckets #8
leo_per_object_queue | running | 178 | 100 | 1000 | requests of fixing inconsistency of objects
leo_rebalance_queue | idling | 0 | 100 | 1000 | requests of relocating objects
leo_recovery_node_queue | idling | 0 | 100 | 1000 | requests of recovering objects of the node (incl. recover-consistency)
leo_req_delete_dir_queue | idling | 0 | 100 | 1000 | requests of removing directories
leo_sync_by_vnode_id_queue | idling | 0 | 100 | 1000 | requests of synchronizing objects by vnode-id
leo_sync_obj_with_dc_queue | idling | 8 | 100 | 1000 | requests of synchronizing objects w/remote-node
==========
storage_06
id | state | number of msgs | batch of msgs | interval | description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
leo_async_deletion_queue | idling | 0 | 100 | 1000 | requests of removing objects asynchronously
leo_comp_meta_with_dc_queue | idling | 0 | 100 | 1000 | requests of comparing metadata w/remote-node
leo_delete_dir_queue_1 | idling | 0 | 100 | 1000 | requests of removing buckets #1
leo_delete_dir_queue_2 | idling | 0 | 100 | 1000 | requests of removing buckets #2
leo_delete_dir_queue_3 | idling | 0 | 100 | 1000 | requests of removing buckets #3
leo_delete_dir_queue_4 | idling | 0 | 100 | 1000 | requests of removing buckets #4
leo_delete_dir_queue_5 | idling | 0 | 100 | 1000 | requests of removing buckets #5
leo_delete_dir_queue_6 | idling | 0 | 100 | 1000 | requests of removing buckets #6
leo_delete_dir_queue_7 | idling | 0 | 100 | 1000 | requests of removing buckets #7
leo_delete_dir_queue_8 | idling | 0 | 100 | 1000 | requests of removing buckets #8
leo_per_object_queue | running | 228 | 100 | 1000 | requests of fixing inconsistency of objects
leo_rebalance_queue | idling | 0 | 100 | 1000 | requests of relocating objects
leo_recovery_node_queue | idling | 0 | 100 | 1000 | requests of recovering objects of the node (incl. recover-consistency)
leo_req_delete_dir_queue | idling | 0 | 100 | 1000 | requests of removing directories
leo_sync_by_vnode_id_queue | idling | 0 | 100 | 1000 | requests of synchronizing objects by vnode-id
leo_sync_obj_with_dc_queue | idling | 0 | 100 | 1000 | requests of synchronizing objects w/remote-node
And these are the error logs of my network management node:
[W] gateway_01@10.0.1.103 2019-01-09 23:13:36.664776 +0800 1547046816 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{309428713511234830042129322128762777919,<<"xz-img-01/51,0,12,9422,2697,2000,3cc29bb8.jpg">>},-1,-1,false,13855437},30000]}}}}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:36.881613 +0800 1547046816 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_1,{get,{221241018722809845335291404478752879363,<<"xz-img-01/51,0,31,5897,2000,1333,42564455.jpg">>},-1,-1,false,13458583},30000]}}}}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:36.973674 +0800 1547046816 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_1,{get,{221241018722809845335291404478752879363,<<"xz-img-01/51,0,31,5897,2000,1333,42564455.jpg">>},-1,-1,false,13458675},30000]}}}}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:37.312840 +0800 1547046817 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:37.396117 +0800 1547046817 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_50004_2,{get,{123201639718953384550134771689223183842,<<"xz-img-01/51,0,30,18212,2666,2000,695775a4.jpg">>},-1,-1,false,14269474},30000]}}}}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:37.668072 +0800 1547046817 leo_gateway_rpc_handler:handle_error/5 298 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,put},{cause,"Replicate failure"}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:37.921751 +0800 1547046817 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{278625753006254488613486616696364700270,<<"xz-img-01/51,0,50,5948,2999,2000,433c6594.jpg">>},-1,-1,false,13459623},30000]}}}}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:38.378746 +0800 1547046818 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:38.380149 +0800 1547046818 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_50004_2,{get,{123201639718953384550134771689223183842,<<"xz-img-01/51,0,30,18212,2666,2000,695775a4.jpg">>},-1,-1,false,14270458},30000]}}}}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:38.571475 +0800 1547046818 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{278625753006254488613486616696364700270,<<"xz-img-01/51,0,50,5948,2999,2000,433c6594.jpg">>},-1,-1,false,13460273},30000]}}}}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:41.169871 +0800 1547046821 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{309428713511234830042129322128762777919,<<"xz-img-01/51,0,12,9422,2697,2000,3cc29bb8.jpg">>},-1,-1,false,13859942},30000]}}}}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:42.192968 +0800 1547046822 leo_gateway_rpc_handler:handle_error/5 298 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,put},{cause,"Replicate failure"}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:42.378423 +0800 1547046822 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_1,{get,{221241018722809845335291404478752879363,<<"xz-img-01/51,0,31,5897,2000,1333,42564455.jpg">>},-1,-1,false,13464080},30000]}}}}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:42.479124 +0800 1547046822 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_1,{get,{221241018722809845335291404478752879363,<<"xz-img-01/51,0,31,5897,2000,1333,42564455.jpg">>},-1,-1,false,13464180},30000]}}}}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:43.77648 +0800 1547046823 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{278625753006254488613486616696364700270,<<"xz-img-01/51,0,50,5948,2999,2000,433c6594.jpg">>},-1,-1,false,13464779},30000]}}}}]
[W] gateway_01@10.0.1.103 2019-01-09 23:13:44.67854 +0800 1547046824 leo_gateway_rpc_handler:handle_error/5 303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{278625753006254488613486616696364700270,<<"xz-img-01/51,0,50,5948,2999,2000,433c6594.jpg">>},-1,-1,false,13465769},30000]}}}}]
Finally, the error log of my storage node:
[W] storage_01@10.0.1.106 2019-01-09 23:26:04.334265 +0800 1547047564 leo_storage_read_repairer:loop/6 122 [{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:04.334730 +0800 1547047564 leo_storage_read_repairer:compare/4 174 [{node,'storage_03@10.0.1.108'},{addr_id,28594113901002211803075798585761706491},{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{clock,1542114352087639},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:04.334769 +0800 1547047564 leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,28594113901002211803075798585761706491},{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{clock,1542114352087639},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:04.596170 +0800 1547047564 leo_storage_read_repairer:loop/6 122 [{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:04.596268 +0800 1547047564 leo_storage_read_repairer:compare/4 174 [{node,'storage_05@10.0.1.110'},{addr_id,327593916421263477271857669055578618711},{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{clock,1545889560965447},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:04.596306 +0800 1547047564 leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,327593916421263477271857669055578618711},{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{clock,1545889560965447},{cause,timeout}]
[E] storage_01@10.0.1.106 2019-01-09 23:26:04.676252 +0800 1547047564 gen_server:call 0 gen_server leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in gen_server:call/2 line 204
[E] storage_01@10.0.1.106 2019-01-09 23:26:04.676440 +0800 1547047564 gen_server:call 0 CRASH REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in gen_server:terminate/7 line 812
[E] storage_01@10.0.1.106 2019-01-09 23:26:04.676641 +0800 1547047564 gen_server:call 0 Supervisor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.22727.13> exit with reason no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in context child_terminated
[W] storage_01@10.0.1.106 2019-01-09 23:26:07.231076 +0800 1547047567 leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"xz-img-01/51,0,46,3074,2666,2000,f0394644.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:08.218021 +0800 1547047568 leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"xz-img-01/51,0,46,3074,2666,2000,f0394644.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:08.654064 +0800 1547047568 leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"xz-img-01/51,0,86,12634,1800,1200,76c06fc3.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:08.832096 +0800 1547047568 leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,28594113901002211803075798585761706491},{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{clock,1542114352087639},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:08.832555 +0800 1547047568 leo_storage_read_repairer:loop/6 122 [{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:08.832622 +0800 1547047568 leo_storage_read_repairer:compare/4 174 [{node,'storage_03@10.0.1.108'},{addr_id,28594113901002211803075798585761706491},{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{clock,1542114352087639},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:09.96855 +0800 1547047569 leo_storage_read_repairer:compare/4 174 [{node,'storage_05@10.0.1.110'},{addr_id,327593916421263477271857669055578618711},{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{clock,1545889560965447},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_1_0,{head_with_check_avs,{327593916421263477271857669055578618711,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},check_header,14339443},30000]}}}}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:09.97249 +0800 1547047569 leo_storage_read_repairer:loop/6 122 [{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:09.97303 +0800 1547047569 leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,327593916421263477271857669055578618711},{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{clock,1545889560965447},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:09.525111 +0800 1547047569 leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"xz-img-01/51,0,86,12634,1800,1200,76c06fc3.jpg">>},{cause,timeout}]
[E] storage_01@10.0.1.106 2019-01-09 23:26:09.677330 +0800 1547047569 gen_server:call 0 gen_server leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in gen_server:call/2 line 204
[E] storage_01@10.0.1.106 2019-01-09 23:26:09.678023 +0800 1547047569 gen_server:call 0 CRASH REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in gen_server:terminate/7 line 812
[E] storage_01@10.0.1.106 2019-01-09 23:26:09.678264 +0800 1547047569 gen_server:call 0 Supervisor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.22827.13> exit with reason no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in context child_terminated
[E] storage_01@10.0.1.106 2019-01-09 23:26:10.36222 +0800 1547047570 null:null 0 {module,"leo_mq_consumer"},{function,"consume/4"},{line,535},{body,[{module,leo_storage_mq},{id,leo_sync_obj_with_dc_queue},{cause,{timeout,{gen_server,call,[leo_object_storage_read_10000_0,{get,{229460670587452068026601146147356987503,<<"xz-img-01/51,0,62,24332,132,132,6922bea2.jpg">>},-1,-1,false,15022113},30000]}}}]}
[W] storage_01@10.0.1.106 2019-01-09 23:26:10.200066 +0800 1547047570 leo_storage_handler_object:read_and_repair_3/3 1434 [{key,<<"xz-img-01/51,0,86,12634,1800,1200,76c06fc3.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106 2019-01-09 23:26:11.965959 +0800 1547047571 leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,28903003486663116147292874696914478445},{key,<<"xz-img-01/51,0,24,14729,1656,1242,e953b005.jpg">>},{clock,1545127149393637},{cause,timeout}]
[E] storage_01@10.0.1.106 2019-01-09 23:26:12.154656 +0800 1547047572 null:null 0 {module,"leo_pod_manager"},{function,"start_child/2"},{line,324},{body,{connection_error,{connection_error,econnrefused}}}
[E] storage_01@10.0.1.106 2019-01-09 23:26:12.155043 +0800 1547047572 null:null 0 Supervisor leo_rpc_client_storage_14_at_13077_sup had child leo_rpc_client_storage_14_at_13077 started with leo_pod_manager:start_link(leo_rpc_client_storage_14_at_13077, 8, 8, leo_rpc_client_conn, [storage_14,"10.16.1.74",13077,0], #Fun<leo_rpc_client_sup.0.73440599>) at undefined exit with reason {connection_error,{connection_error,econnrefused}} in context start_error
[E] storage_01@10.0.1.106 2019-01-09 23:26:12.155201 +0800 1547047572 null:null 0 CRASH REPORT Process <0.23087.13> with 1 neighbours exited with reason: {connection_error,{connection_error,econnrefused}} in gen_server:init_it/6 line 344
[W] storage_01@10.0.1.106 2019-01-09 23:26:12.155398 +0800 1547047572 null:null 0 {module,"leo_rpc_client_sup"},{function,"start_child/3"},{line,106},{body,{{shutdown,{failed_to_start_child,leo_rpc_client_storage_14_at_13077,{connection_error,{connection_error,econnrefused}}}},{child,undefined,leo_rpc_client_storage_14_at_13077,{leo_pod_sup,start_link,[leo_rpc_client_storage_14_at_13077,8,8,leo_rpc_client_conn,[storage_14,"10.16.1.74",13077,0],#Fun<leo_rpc_client_sup.0.73440599>]},permanent,10000,supervisor,[leo_pod_sup]}}}
[E] storage_01@10.0.1.106 2019-01-09 23:26:12.194692 +0800 1547047572 null:null 0 {module,"leo_pod_manager"},{function,"start_child/2"},{line,324},{body,{connection_error,{connection_error,econnrefused}}}
[E] storage_01@10.0.1.106 2019-01-09 23:26:12.194994 +0800 1547047572 null:null 0 Supervisor leo_rpc_client_storage_14_at_13077_sup had child leo_rpc_client_storage_14_at_13077 started with leo_pod_manager:start_link(leo_rpc_client_storage_14_at_13077, 8, 8, leo_rpc_client_conn, [storage_14,"10.16.1.74",13077,0], #Fun<leo_rpc_client_sup.0.73440599>) at undefined exit with reason {connection_error,{connection_error,econnrefused}} in context start_error
[E] storage_01@10.0.1.106 2019-01-09 23:26:12.195140 +0800 1547047572 null:null 0 CRASH REPORT Process <0.23091.13> with 1 neighbours exited with reason: {connection_error,{connection_error,econnrefused}} in gen_server:init_it/6 line 344
[W] storage_01@10.0.1.106 2019-01-09 23:26:12.195310 +0800 1547047572 null:null 0 {module,"leo_rpc_client_sup"},{function,"start_child/3"},{line,106},{body,{{shutdown,{failed_to_start_child,leo_rpc_client_storage_14_at_13077,{connection_error,{connection_error,econnrefused}}}},{child,undefined,leo_rpc_client_storage_14_at_13077,{leo_pod_sup,start_link,[leo_rpc_client_storage_14_at_13077,8,8,leo_rpc_client_conn,[storage_14,"10.16.1.74",13077,0],#Fun<leo_rpc_client_sup.0.73440599>]},permanent,10000,supervisor,[leo_pod_sup]}}}
@mocchira This is my production environment. Please reply as soon as possible. Thank you very much.
@mocchira Moreover, I found that occasionally mistake 500 appear, which I have not seen before. Can you help me analyze why 500 appears?
@mocchira Also, I found a very strange problem.
One of my PUT requests, haproxy returned me 200, but my leo_gateway did not receive this PUT request. Is this possible?
If possible, how can I judge whether PUT is successful?
@NewDund Sorry for the long delay. I can spare time to look into this tonight.
@mocchira My cluster has been functioning normally, but today there is a weird situation. Between 17:58-18:18, my storage node was loaded very heavily, and my request returned a lot of 504. When I ran'leofs-adm mq-stats storage-node', I found'leo_sync_obj_with_dc_queue'became very high, but I didn't do anything today. I don't know why? Here are my crash. log and error logs.
crash.log
error
If you need any information, please contact me. Because it is the production environment, so I hope to reply as soon as possible, thank you.