Storage node overloads and returns 504, Very urgent !!!

NewDund commented 5 years ago

@mocchira My cluster has been functioning normally, but today there is a weird situation. Between 17:58-18:18, my storage node was loaded very heavily, and my request returned a lot of 504. When I ran'leofs-adm mq-stats storage-node', I found'leo_sync_obj_with_dc_queue'became very high, but I didn't do anything today. I don't know why? Here are my crash. log and error logs.

crash.log

2018-12-19 18:03:05 =ERROR REPORT====
** Generic server <0.18428.2146> terminating
** Last message in was {get,2,{http_req,#Port<0.36308029>,ranch_tcp,keepalive,<0.18061.2146>,<<"GET">>,'HTTP/1.1',{{10,0,1,104},58701},<<"10.0.1.104">>,undefined,80,<<"/xz-img-01/51%2C0%2C81%2C14929%2C4222%2C2000%2C854ec528.jpg">>,undefined,<<>>,[],[],[{<<"host">>,<<"10.0.1.104">>},{<<"user-agent">>,<<"aws-sdk-php/1.6.2 PHP/5.6.27 curl/7.49.1 openssl/1.0.2h">>},{<<"accept">>,<<"*/*">>},{<<"accept-encoding">>,<<"gzip, deflate">>},{<<"referer">>,<<"http://10.0.1.104/xz-img-01/51%2C0%2C81%2C14929%2C4222%2C2000%2C854ec528.jpg">>},{<<"content-type">>,<<"application/x-www-form-urlencoded">>},{<<"date">>,<<"Wed, 19 Dec 2018 10:02:56 GMT">>},{<<"authorization">>,<<"AWS ea925e5b9fbc24e48e46:NwMk+eIcPXQlHOFAebcTiBu6mew=">>},{<<"content-length">>,<<"0">>}],[],undefined,[],waiting,<<>>,undefined,false,waiting,[],<<>>,undefined},{metadata_3,<<"xz-img-01/51,0,81,14929,4222,2000,854ec528.jpg">>,197420687858415348216969267030332246516,46,6917037,<<>>,0,5242880,2,0,87005854909,1545211399859741,63712430599,150382645841396063523320865937486611814,355900593,undefined,0,0,0,0,0,0}}
** When Server state == {state,<<"xz-img-01/51,0,81,14929,4222,2000,854ec528.jpg">>,{transport_record,ranch_tcp,#Port<0.36308029>,5242880},false,undefined}
** Reason for termination ==
** {enotconn,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
2018-12-19 18:03:05 =CRASH REPORT====
  crasher:
    initial call: leo_large_object_get_handler:init/1
    pid: <0.18428.2146>
    registered_name: []
    exception exit: {{enotconn,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]},[{gen_server,terminate,7,[{file,"gen_server.erl"},{line,812}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
    ancestors: [<0.18061.2146>]
    messages: []
    links: [<0.18061.2146>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 1079
  neighbours:
    neighbour: [{pid,<0.18061.2146>},{registered_name,[]},{initial_call,{cowboy_protocol,init,4}},{current_function,{gen,do_call,4}},{ancestors,[]},{messages,[{cowboy_req,resp_sent}]},{links,[<0.13843.1>,<0.18428.2146>,#Port<0.36308029>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,2586},{stack_size,52},{reductions,3412}]
2018-12-19 18:03:05 =ERROR REPORT====
Ranch listener leo_gateway_s3_api had connection process started with cowboy_protocol:start_link/4 at <0.18061.2146> exit with reason: {enotconn,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
2018-12-19 18:03:05 =ERROR REPORT====
** Generic server <0.18437.2146> terminating
** Last message in was {get,2,{http_req,#Port<0.36308099>,ranch_tcp,keepalive,<0.18239.2146>,<<"GET">>,'HTTP/1.1',{{10,0,1,104},58748},<<"10.0.1.104">>,undefined,80,<<"/xz-img-01/51%2C0%2C81%2C14929%2C4222%2C2000%2C854ec528.jpg">>,undefined,<<>>,[],[],[{<<"host">>,<<"10.0.1.104">>},{<<"user-agent">>,<<"aws-sdk-php/1.6.2 PHP/5.6.27 curl/7.49.1 openssl/1.0.2h">>},{<<"accept">>,<<"*/*">>},{<<"accept-encoding">>,<<"gzip, deflate">>},{<<"referer">>,<<"http://10.0.1.104/xz-img-01/51%2C0%2C81%2C14929%2C4222%2C2000%2C854ec528.jpg">>},{<<"content-type">>,<<"application/x-www-form-urlencoded">>},{<<"date">>,<<"Wed, 19 Dec 2018 10:02:58 GMT">>},{<<"authorization">>,<<"AWS ea925e5b9fbc24e48e46:qUPBHb8o9VjV+ZxzCAH2YLfQvr0=">>},{<<"content-length">>,<<"0">>}],[],undefined,[],waiting,<<>>,undefined,false,waiting,[],<<>>,undefined},{metadata_3,<<"xz-img-01/51,0,81,14929,4222,2000,854ec528.jpg">>,197420687858415348216969267030332246516,46,6917037,<<>>,0,5242880,2,0,87005854909,1545211399859741,63712430599,150382645841396063523320865937486611814,355900593,undefined,0,0,0,0,0,0}}
** When Server state == {state,<<"xz-img-01/51,0,81,14929,4222,2000,854ec528.jpg">>,{transport_record,ranch_tcp,#Port<0.36308099>,5242880},false,undefined}
** Reason for termination ==
** {enotconn,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
2018-12-19 18:03:05 =CRASH REPORT====
  crasher:
    initial call: leo_large_object_get_handler:init/1
    pid: <0.18437.2146>
    registered_name: []
    exception exit: {{enotconn,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]},[{gen_server,terminate,7,[{file,"gen_server.erl"},{line,812}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
    ancestors: [<0.18239.2146>]
    messages: []
    links: [<0.18239.2146>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 1079
  neighbours:
    neighbour: [{pid,<0.18239.2146>},{registered_name,[]},{initial_call,{cowboy_protocol,init,4}},{current_function,{gen,do_call,4}},{ancestors,[]},{messages,[{cowboy_req,resp_sent}]},{links,[<0.13843.1>,<0.18437.2146>,#Port<0.36308099>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,2586},{stack_size,52},{reductions,3412}]
2018-12-19 18:03:05 =ERROR REPORT====
Ranch listener leo_gateway_s3_api had connection process started with cowboy_protocol:start_link/4 at <0.18239.2146> exit with reason: {enotconn,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
2018-12-19 18:03:27 =ERROR REPORT====
** Generic server <0.18451.2146> terminating
** Last message in was {get,2,{http_req,#Port<0.36308192>,ranch_tcp,keepalive,<0.18370.2146>,<<"GET">>,'HTTP/1.1',{{10,0,1,104},58805},<<"10.0.1.104">>,undefined,80,<<"/xz-img-01/51%2C0%2C81%2C14929%2C4222%2C2000%2C854ec528.jpg">>,undefined,<<>>,[],[],[{<<"host">>,<<"10.0.1.104">>},{<<"user-agent">>,<<"aws-sdk-php/1.6.2 PHP/5.6.27 curl/7.49.1 openssl/1.0.2h">>},{<<"accept">>,<<"*/*">>},{<<"accept-encoding">>,<<"gzip, deflate">>},{<<"referer">>,<<"http://10.0.1.104/xz-img-01/51%2C0%2C81%2C14929%2C4222%2C2000%2C854ec528.jpg">>},{<<"content-type">>,<<"application/x-www-form-urlencoded">>},{<<"date">>,<<"Wed, 19 Dec 2018 10:03:02 GMT">>},{<<"authorization">>,<<"AWS ea925e5b9fbc24e48e46:xacTyFtJ5aMFCiTyV0RzW+yCL+c=">>},{<<"content-length">>,<<"0">>}],[],undefined,[],waiting,<<>>,undefined,false,waiting,[],<<>>,undefined},{metadata_3,<<"xz-img-01/51,0,81,14929,4222,2000,854ec528.jpg">>,197420687858415348216969267030332246516,46,6917037,<<>>,0,5242880,2,0,87005854909,1545211399859741,63712430599,150382645841396063523320865937486611814,355900593,undefined,0,0,0,0,0,0}}
** When Server state == {state,<<"xz-img-01/51,0,81,14929,4222,2000,854ec528.jpg">>,{transport_record,ranch_tcp,#Port<0.36308192>,5242880},false,undefined}
** Reason for termination ==
** {closed,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
2018-12-19 18:03:27 =CRASH REPORT====
  crasher:
    initial call: leo_large_object_get_handler:init/1
    pid: <0.18451.2146>
    registered_name: []
    exception exit: {{closed,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]},[{gen_server,terminate,7,[{file,"gen_server.erl"},{line,812}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
    ancestors: [<0.18370.2146>]
    messages: []
    links: [<0.18370.2146>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 1307
  neighbours:
    neighbour: [{pid,<0.18370.2146>},{registered_name,[]},{initial_call,{cowboy_protocol,init,4}},{current_function,{gen,do_call,4}},{ancestors,[]},{messages,[{cowboy_req,resp_sent}]},{links,[<0.13843.1>,<0.18451.2146>,#Port<0.36308192>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,2586},{stack_size,52},{reductions,3412}]
2018-12-19 18:03:27 =ERROR REPORT====
Ranch listener leo_gateway_s3_api had connection process started with cowboy_protocol:start_link/4 at <0.18370.2146> exit with reason: {closed,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}

error

[W] gateway_01@10.0.1.103   2018-12-19 18:10:06.832107 +0800    1545214206  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:07.718168 +0800    1545214207  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:08.496338 +0800    1545214208  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:10.385112 +0800    1545214210  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:11.262133 +0800    1545214211  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:11.331273 +0800    1545214211  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:11.403194 +0800    1545214211  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:11.602119 +0800    1545214211  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:13.252119 +0800    1545214213  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:14.570133 +0800    1545214214  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:15.556191 +0800    1545214215  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:17.700117 +0800    1545214217  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:20.974151 +0800    1545214220  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_04@10.0.1.109'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:25.51340 +0800 1545214225  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:28.598102 +0800    1545214228  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:30.52759 +0800 1545214230  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:31.908043 +0800    1545214231  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:32.98143 +0800 1545214232  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:33.599150 +0800    1545214233  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:34.440171 +0800    1545214234  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_04@10.0.1.109'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:35.54134 +0800 1545214235  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:35.506103 +0800    1545214235  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:36.406227 +0800    1545214236  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:38.80066 +0800 1545214238  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:38.497026 +0800    1545214238  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:38.600112 +0800    1545214238  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:39.239158 +0800    1545214239  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_04@10.0.1.109'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:40.809148 +0800    1545214240  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:40.827144 +0800    1545214240  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:41.165104 +0800    1545214241  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:41.332036 +0800    1545214241  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:41.407166 +0800    1545214241  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:42.171189 +0800    1545214242  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:44.150284 +0800    1545214244  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:45.628375 +0800    1545214245  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:46.16232 +0800 1545214246  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:46.408126 +0800    1545214246  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:47.697231 +0800    1545214247  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:47.701054 +0800    1545214247  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:48.628117 +0800    1545214248  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:49.530327 +0800    1545214249  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:52.173096 +0800    1545214252  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:54.254144 +0800    1545214254  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:55.508230 +0800    1545214255  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:10:58.399195 +0800    1545214258  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:11:05.507265 +0800    1545214265  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:11:35.508243 +0800    1545214295  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:11:59.501286 +0800    1545214319  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:12:07.716355 +0800    1545214327  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:12:12.825171 +0800    1545214332  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:12:19.533378 +0800    1545214339  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:12:29.502258 +0800    1545214349  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:12:37.717278 +0800    1545214357  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:12:42.826205 +0800    1545214362  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:12:49.534201 +0800    1545214369  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:25:49.532172 +0800    1545215149  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:25:58.744361 +0800    1545215158  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:26:03.263182 +0800    1545215163  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:26:09.941075 +0800    1545215169  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:26:14.36295 +0800 1545215174  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:26:19.533228 +0800    1545215179  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:26:28.745348 +0800    1545215188  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:26:33.264141 +0800    1545215193  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:26:39.942148 +0800    1545215199  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:26:44.37257 +0800 1545215204  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:26:49.534314 +0800    1545215209  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:26:58.747222 +0800    1545215218  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:27:03.265327 +0800    1545215223  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:27:09.943436 +0800    1545215229  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:27:14.38267 +0800 1545215234  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:31:22.92283 +0800 1545215482  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W] gateway_01@10.0.1.103   2018-12-19 18:31:52.594265 +0800    1545215512  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_04@10.0.1.109'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]

If you need any information, please contact me. Because it is the production environment, so I hope to reply as soon as possible, thank you.

mocchira commented 5 years ago

@NewDund The error enotconn we can see in crash.log will happen in case the connection is NOT established (for TCP connection) so I guess some network trouble happened around that time and some connections were forcefully closed by network switches placed between LeoGateway and LeoStorage. So I'd recommend you contact your infra team (OR if your run LeoFS on some public cloud then contact the provider) to confirm if something went wrong around that time.

If you need any information, please contact me.

Just in case, please give us the error log files on LeoStorage if exists.

mocchira commented 5 years ago

And also any system metrics related to network would be helpful if your system gather those information regularly.

NewDund commented 5 years ago

@mocchira This is part of the error log of the storage node I intercepted, and you can see if it's helpful.

[W] storage_01@10.0.1.106   2018-12-19 18:08:22.866362 +0800    1545214102  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:23.62127 +0800 1545214103  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:24.547035 +0800    1545214104  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,115042698958933531525465566678727684218},{key,<<"xz-img-01/51,0,96,15055,1500,2000,87fc2486.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:24.567591 +0800    1545214104  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,85522489140579400833319482861472971630},{key,<<"xz-img-01/51,0,6,14972,3038,2000,d559c8f4.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:29.645343 +0800    1545214109  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,322931588457243899315235525521163488474},{key,<<"xz-img-01/51,0,58,14908,1616,1080,4b3b45e9.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:30.740829 +0800    1545214110  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,301825275647700918725340957232212760431},{key,<<"xz-img-01/51,0,3,15066,2666,2000,74b05a3c.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:31.836745 +0800    1545214111  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,302704449271088706685579725248233407195},{key,<<"xz-img-01/51,0,47,15060,800,800,842d3c77.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:31.840704 +0800    1545214111  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,302704449271088706685579725248233407195},{key,<<"xz-img-01/51,0,47,15060,800,800,842d3c77.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:31.935687 +0800    1545214111  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,200868615888498000502667266991659758280},{key,<<"xz-img-01/51,0,57,14944,1500,2000,500e72e8.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:32.722483 +0800    1545214112  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,302704449271088706685579725248233407195},{key,<<"xz-img-01/51,0,47,15060,800,800,842d3c77.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:34.658964 +0800    1545214114  leo_storage_read_repairer:compare/4 174 [{node,'storage_03@10.0.1.108'},{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{clock,1545214100177830},{cause,secondary_inconsistency}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:34.661988 +0800    1545214114  leo_storage_read_repairer:compare/4 174 [{node,'storage_03@10.0.1.108'},{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{clock,1545214100177830},{cause,secondary_inconsistency}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:34.662048 +0800    1545214114  leo_storage_read_repairer:compare/4 174 [{node,'storage_03@10.0.1.108'},{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{clock,1545214100177830},{cause,secondary_inconsistency}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:35.173043 +0800    1545214115  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,303555783980173236865626863861593103855},{key,<<"xz-img-01/51,0,40,15027,1024,682,54cd19b0.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:35.193046 +0800    1545214115  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,303555783980173236865626863861593103855},{key,<<"xz-img-01/51,0,40,15027,1024,682,54cd19b0.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:35.362758 +0800    1545214115  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:35.699131 +0800    1545214115  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,132627485593750080073349362187704947564},{key,<<"xz-img-01/51,0,95,14931,1800,1200,dcbe253e.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:35.945801 +0800    1545214115  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,45662221158025527418605584447911674797},{key,<<"xz-img-01/51,0,37,15070,1500,2000,c8d4e3e2.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:35.950816 +0800    1545214115  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,45662221158025527418605584447911674797},{key,<<"xz-img-01/51,0,37,15070,1500,2000,c8d4e3e2.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:35.964784 +0800    1545214115  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,45662221158025527418605584447911674797},{key,<<"xz-img-01/51,0,37,15070,1500,2000,c8d4e3e2.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:38.419244 +0800    1545214118  leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,274799762605490884010058104327502159603},{key,<<"xz-img-01/51,0,79,15227,3000,2000,aa740cb9.jpg">>},{clock,1545214072066214},{cause,timeout}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:38.433349 +0800    1545214118  leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,274799762605490884010058104327502159603},{key,<<"xz-img-01/51,0,79,15227,3000,2000,aa740cb9.jpg">>},{clock,1545214072066214},{cause,timeout}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:38.758215 +0800    1545214118  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,139114591507114234479358787829134935318},{key,<<"xz-img-01/51,0,11,14976,132,132,5030263d.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:38.799226 +0800    1545214118  leo_storage_read_repairer:compare/4 174 [{node,'storage_05@10.0.1.110'},{addr_id,200868615888498000502667266991659758280},{key,<<"xz-img-01/51,0,57,14944,1500,2000,500e72e8.jpg">>},{clock,1545214103039150},{cause,secondary_inconsistency}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:39.25278 +0800 1545214119  leo_storage_read_repairer:compare/4 174 [{node,'storage_03@10.0.1.108'},{addr_id,46555256417576372219622714676382478038},{key,<<"xz-img-01/51,0,58,3146,1334,1334,34132f53.jpg">>},{clock,1543114132553194},{cause,timeout}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:39.494405 +0800    1545214119  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,322931588457243899315235525521163488474},{key,<<"xz-img-01/51,0,58,14908,1616,1080,4b3b45e9.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:44.970543 +0800    1545214124  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,200868615888498000502667266991659758280},{key,<<"xz-img-01/51,0,57,14944,1500,2000,500e72e8.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:44.971314 +0800    1545214124  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,200868615888498000502667266991659758280},{key,<<"xz-img-01/51,0,57,14944,1500,2000,500e72e8.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:45.526425 +0800    1545214125  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:45.540390 +0800    1545214125  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:45.553651 +0800    1545214125  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:45.566184 +0800    1545214125  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:45.574858 +0800    1545214125  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,218849011693645702606343196622539469801},{key,<<"xz-img-01/51,0,86,15030,1500,2000,bf93d6ac.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:48.393474 +0800    1545214128  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:48.844323 +0800    1545214128  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:48.844409 +0800    1545214128  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:48.844456 +0800    1545214128  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:48.845286 +0800    1545214128  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:48.875203 +0800    1545214128  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:49.205864 +0800    1545214129  leo_storage_read_repairer:compare/4 174 [{node,'storage_05@10.0.1.110'},{addr_id,322931588457243899315235525521163488474},{key,<<"xz-img-01/51,0,58,14908,1616,1080,4b3b45e9.jpg">>},{clock,1545214109131558},{cause,primary_inconsistency}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:49.506328 +0800    1545214129  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,218849011693645702606343196622539469801},{key,<<"xz-img-01/51,0,86,15030,1500,2000,bf93d6ac.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:49.636818 +0800    1545214129  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,45662221158025527418605584447911674797},{key,<<"xz-img-01/51,0,37,15070,1500,2000,c8d4e3e2.jpg">>},{cause,sending_data_to_remote}]
[W] storage_01@10.0.1.106   2018-12-19 18:08:50.353193 +0800    1545214130  leo_sync_remote_cluster:defer_stack_1/1 86  [{addr_id,132627485593750080073349362187704947564},{key,<<"xz-img-01/51,0,95,14931,1800,1200,dcbe253e.jpg">>},{cause,sending_data_to_remote}]

NewDund commented 5 years ago

@mocchira Here is my traffic limit order by tc. /usr/sbin/tc qdisc add dev eno16777984 root tbf rate 150Mbit latency 60ms burst 15000 This restriction is a safe amount for my intranet communications. But I don't know how large the leofs can be used. Can you give me some advice?

mocchira commented 5 years ago

@NewDund I can see many error lines including the word sending_data_to_remote. So let me confirm that under a normal circumstance, does this error line appear with such a high frequency rate? Since sending_data_to_remote means that "LeoStorage is trying to do MDCR with remote clusters so now it's impossible to take another replication tasks, try again after a certain period of time", MDCR seems to become bottleneck and that might cause LeoStorage to become high load situation.

One possible reason why MDCR becomes bottleneck is your bandwidth setting 150Mbit so please check the bandwidth actually used at the time the incident happened. if the network bandwidth is occupied then you may have to consider setting more larger number like 200 - 300Mbit to the bandwidth setting.

NewDund commented 5 years ago

@mocchira You mean that both my local and remote cluster bandwidth restrictions should be increased, right?

NewDund commented 5 years ago

@mocchira I checked it according to your prompt and found that there might be network fluctuation between our local cluster and remote cluster at that time. (Of course, I'm not entirely sure)

The main point I want to say is that local clusters and remote clusters should theoretically be independent of each other, and nothing happens to one cluster should affect the other. That is to say, even if the sudden failure of communication between two clusters leads to the failure of replicating objects or restoring clusters, it should not affect the normal access requests of another cluster.

I think this should be the focus of optimization. NEED FIX ~

mocchira commented 5 years ago

@NewDund

I checked it according to your prompt and found that there might be network fluctuation between our local cluster and remote cluster at that time. (Of course, I'm not entirely sure)

Got it then probably the cause would be network related things.

The main point I want to say is that local clusters and remote clusters should theoretically be independent of each other, and nothing happens to one cluster should affect the other. That is to say, even if the sudden failure of communication between two clusters leads to the failure of replicating objects or restoring clusters, it should not affect the normal access requests of another cluster.

Right. The current retry mechanism we are using at any asynchronous tasks including MDCR isn't optimized sufficiently (There are rooms for improvements) so we are going to improve it in the future release.

NewDund commented 5 years ago

Well, anyway. It is hoped that the related issues will be seen in the description of future version updates.

mocchira commented 5 years ago

Yes we will. At first, I will file the issue related MDCR to deal with the situation in which the network bandwidth between both clusters is unstable.

mocchira commented 5 years ago

@NewDund also I will share the recommended configuration which might contribute to make this kind of incident less likely to happen compared to the default settings. Please wait for a while.

mocchira commented 5 years ago

@NewDund The setting mdc_replication.size_of_stacked_objs in leo_storage.conf explained on https://leo-project.net/leofs/docs/admin/settings/leo_storage/ would be to control how much amount of data is transferred to the remote cluster at once. The default is around 32MB. that said, this number probably would be too large for your environment (capped 150Mbit) so I'd recommend you lower this setting such as 16MB (16777216). This might prevent the same kind of incident from happening again.

NewDund commented 5 years ago

Okay, thank you, but we're going to stop MDCR first, and then use self-written programs to manually synchronize the two clusters.

Then if the official has a more perfect solution, we will test it again.

mocchira commented 5 years ago

@NewDund

Okay, thank you, but we're going to stop MDCR first, and then use self-written programs to manually synchronize the two clusters.

OK. little bit sad for us but your service stability should be the matter prior to anything else.

Then if the official has a more perfect solution, we will test it again.

Thanks! I'm sure we will be able to improve the MDCR stablity thanks to your feedback so once we are confident to meet your requirements, we'll get back to you.

NewDund commented 5 years ago

@mocchira My cluster has almost encountered a thorny problem. My LB has returned a lot of 503 and 504. My LB and network management nodes have set very large timeouts.

Let's take a look at my log.

The following is the result returned by my leofs mq-stats:

==========
storage_01
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 100            | 1000           | requests of removing objects asynchronously
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 100            | 1000           | requests of comparing metadata w/remote-node
 leo_delete_dir_queue_1         |      idling       | 0              | 100            | 1000           | requests of removing buckets #1
 leo_delete_dir_queue_2         |      idling       | 0              | 100            | 1000           | requests of removing buckets #2
 leo_delete_dir_queue_3         |      idling       | 0              | 100            | 1000           | requests of removing buckets #3
 leo_delete_dir_queue_4         |      idling       | 0              | 100            | 1000           | requests of removing buckets #4
 leo_delete_dir_queue_5         |      idling       | 0              | 100            | 1000           | requests of removing buckets #5
 leo_delete_dir_queue_6         |      idling       | 0              | 100            | 1000           | requests of removing buckets #6
 leo_delete_dir_queue_7         |      idling       | 0              | 100            | 1000           | requests of removing buckets #7
 leo_delete_dir_queue_8         |      idling       | 0              | 100            | 1000           | requests of removing buckets #8
 leo_per_object_queue           |      running      | 365            | 100            | 1000           | requests of fixing inconsistency of objects
 leo_rebalance_queue            |      idling       | 0              | 100            | 1000           | requests of relocating objects
 leo_recovery_node_queue        |      idling       | 0              | 100            | 1000           | requests of recovering objects of the node (incl. recover-consistency)
 leo_req_delete_dir_queue       |      idling       | 0              | 100            | 1000           | requests of removing directories
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects by vnode-id
 leo_sync_obj_with_dc_queue     |      idling       | 24             | 100            | 1000           | requests of synchronizing objects w/remote-node
==========
storage_02
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 100            | 1000           | requests of removing objects asynchronously
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 100            | 1000           | requests of comparing metadata w/remote-node
 leo_delete_dir_queue_1         |      idling       | 0              | 100            | 1000           | requests of removing buckets #1
 leo_delete_dir_queue_2         |      idling       | 0              | 100            | 1000           | requests of removing buckets #2
 leo_delete_dir_queue_3         |      idling       | 0              | 100            | 1000           | requests of removing buckets #3
 leo_delete_dir_queue_4         |      idling       | 0              | 100            | 1000           | requests of removing buckets #4
 leo_delete_dir_queue_5         |      idling       | 0              | 100            | 1000           | requests of removing buckets #5
 leo_delete_dir_queue_6         |      idling       | 0              | 100            | 1000           | requests of removing buckets #6
 leo_delete_dir_queue_7         |      idling       | 0              | 100            | 1000           | requests of removing buckets #7
 leo_delete_dir_queue_8         |      idling       | 0              | 100            | 1000           | requests of removing buckets #8
 leo_per_object_queue           |      running      | 245            | 100            | 1000           | requests of fixing inconsistency of objects
 leo_rebalance_queue            |      idling       | 0              | 100            | 1000           | requests of relocating objects
 leo_recovery_node_queue        |      idling       | 0              | 100            | 1000           | requests of recovering objects of the node (incl. recover-consistency)
 leo_req_delete_dir_queue       |      idling       | 0              | 100            | 1000           | requests of removing directories
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects by vnode-id
 leo_sync_obj_with_dc_queue     |      idling       | 10             | 100            | 1000           | requests of synchronizing objects w/remote-node
==========
storage_03
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 100            | 1000           | requests of removing objects asynchronously
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 100            | 1000           | requests of comparing metadata w/remote-node
 leo_delete_dir_queue_1         |      idling       | 0              | 100            | 1000           | requests of removing buckets #1
 leo_delete_dir_queue_2         |      idling       | 0              | 100            | 1000           | requests of removing buckets #2
 leo_delete_dir_queue_3         |      idling       | 0              | 100            | 1000           | requests of removing buckets #3
 leo_delete_dir_queue_4         |      idling       | 0              | 100            | 1000           | requests of removing buckets #4
 leo_delete_dir_queue_5         |      idling       | 0              | 100            | 1000           | requests of removing buckets #5
 leo_delete_dir_queue_6         |      idling       | 0              | 100            | 1000           | requests of removing buckets #6
 leo_delete_dir_queue_7         |      idling       | 0              | 100            | 1000           | requests of removing buckets #7
 leo_delete_dir_queue_8         |      idling       | 0              | 100            | 1000           | requests of removing buckets #8
 leo_per_object_queue           |      running      | 336            | 100            | 1000           | requests of fixing inconsistency of objects
 leo_rebalance_queue            |      idling       | 0              | 100            | 1000           | requests of relocating objects
 leo_recovery_node_queue        |      idling       | 0              | 100            | 1000           | requests of recovering objects of the node (incl. recover-consistency)
 leo_req_delete_dir_queue       |      idling       | 0              | 100            | 1000           | requests of removing directories
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects by vnode-id
 leo_sync_obj_with_dc_queue     |      idling       | 11             | 100            | 1000           | requests of synchronizing objects w/remote-node
==========
storage_04
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 100            | 1000           | requests of removing objects asynchronously
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 100            | 1000           | requests of comparing metadata w/remote-node
 leo_delete_dir_queue_1         |      idling       | 0              | 100            | 1000           | requests of removing buckets #1
 leo_delete_dir_queue_2         |      idling       | 0              | 100            | 1000           | requests of removing buckets #2
 leo_delete_dir_queue_3         |      idling       | 0              | 100            | 1000           | requests of removing buckets #3
 leo_delete_dir_queue_4         |      idling       | 0              | 100            | 1000           | requests of removing buckets #4
 leo_delete_dir_queue_5         |      idling       | 0              | 100            | 1000           | requests of removing buckets #5
 leo_delete_dir_queue_6         |      idling       | 0              | 100            | 1000           | requests of removing buckets #6
 leo_delete_dir_queue_7         |      idling       | 0              | 100            | 1000           | requests of removing buckets #7
 leo_delete_dir_queue_8         |      idling       | 0              | 100            | 1000           | requests of removing buckets #8
 leo_per_object_queue           |      running      | 858            | 100            | 1000           | requests of fixing inconsistency of objects
 leo_rebalance_queue            |      idling       | 0              | 100            | 1000           | requests of relocating objects
 leo_recovery_node_queue        |      idling       | 0              | 100            | 1000           | requests of recovering objects of the node (incl. recover-consistency)
 leo_req_delete_dir_queue       |      idling       | 0              | 100            | 1000           | requests of removing directories
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects by vnode-id
 leo_sync_obj_with_dc_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects w/remote-node
==========
storage_05
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 100            | 1000           | requests of removing objects asynchronously
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 100            | 1000           | requests of comparing metadata w/remote-node
 leo_delete_dir_queue_1         |      idling       | 0              | 100            | 1000           | requests of removing buckets #1
 leo_delete_dir_queue_2         |      idling       | 0              | 100            | 1000           | requests of removing buckets #2
 leo_delete_dir_queue_3         |      idling       | 0              | 100            | 1000           | requests of removing buckets #3
 leo_delete_dir_queue_4         |      idling       | 0              | 100            | 1000           | requests of removing buckets #4
 leo_delete_dir_queue_5         |      idling       | 0              | 100            | 1000           | requests of removing buckets #5
 leo_delete_dir_queue_6         |      idling       | 0              | 100            | 1000           | requests of removing buckets #6
 leo_delete_dir_queue_7         |      idling       | 0              | 100            | 1000           | requests of removing buckets #7
 leo_delete_dir_queue_8         |      idling       | 0              | 100            | 1000           | requests of removing buckets #8
 leo_per_object_queue           |      running      | 178            | 100            | 1000           | requests of fixing inconsistency of objects
 leo_rebalance_queue            |      idling       | 0              | 100            | 1000           | requests of relocating objects
 leo_recovery_node_queue        |      idling       | 0              | 100            | 1000           | requests of recovering objects of the node (incl. recover-consistency)
 leo_req_delete_dir_queue       |      idling       | 0              | 100            | 1000           | requests of removing directories
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects by vnode-id
 leo_sync_obj_with_dc_queue     |      idling       | 8              | 100            | 1000           | requests of synchronizing objects w/remote-node
==========
storage_06
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 100            | 1000           | requests of removing objects asynchronously
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 100            | 1000           | requests of comparing metadata w/remote-node
 leo_delete_dir_queue_1         |      idling       | 0              | 100            | 1000           | requests of removing buckets #1
 leo_delete_dir_queue_2         |      idling       | 0              | 100            | 1000           | requests of removing buckets #2
 leo_delete_dir_queue_3         |      idling       | 0              | 100            | 1000           | requests of removing buckets #3
 leo_delete_dir_queue_4         |      idling       | 0              | 100            | 1000           | requests of removing buckets #4
 leo_delete_dir_queue_5         |      idling       | 0              | 100            | 1000           | requests of removing buckets #5
 leo_delete_dir_queue_6         |      idling       | 0              | 100            | 1000           | requests of removing buckets #6
 leo_delete_dir_queue_7         |      idling       | 0              | 100            | 1000           | requests of removing buckets #7
 leo_delete_dir_queue_8         |      idling       | 0              | 100            | 1000           | requests of removing buckets #8
 leo_per_object_queue           |      running      | 228            | 100            | 1000           | requests of fixing inconsistency of objects
 leo_rebalance_queue            |      idling       | 0              | 100            | 1000           | requests of relocating objects
 leo_recovery_node_queue        |      idling       | 0              | 100            | 1000           | requests of recovering objects of the node (incl. recover-consistency)
 leo_req_delete_dir_queue       |      idling       | 0              | 100            | 1000           | requests of removing directories
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects by vnode-id
 leo_sync_obj_with_dc_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects w/remote-node

And these are the error logs of my network management node:

[W] gateway_01@10.0.1.103   2019-01-09 23:13:36.664776 +0800    1547046816  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{309428713511234830042129322128762777919,<<"xz-img-01/51,0,12,9422,2697,2000,3cc29bb8.jpg">>},-1,-1,false,13855437},30000]}}}}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:36.881613 +0800    1547046816  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_1,{get,{221241018722809845335291404478752879363,<<"xz-img-01/51,0,31,5897,2000,1333,42564455.jpg">>},-1,-1,false,13458583},30000]}}}}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:36.973674 +0800    1547046816  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_1,{get,{221241018722809845335291404478752879363,<<"xz-img-01/51,0,31,5897,2000,1333,42564455.jpg">>},-1,-1,false,13458675},30000]}}}}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:37.312840 +0800    1547046817  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:37.396117 +0800    1547046817  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_50004_2,{get,{123201639718953384550134771689223183842,<<"xz-img-01/51,0,30,18212,2666,2000,695775a4.jpg">>},-1,-1,false,14269474},30000]}}}}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:37.668072 +0800    1547046817  leo_gateway_rpc_handler:handle_error/5  298 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,put},{cause,"Replicate failure"}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:37.921751 +0800    1547046817  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{278625753006254488613486616696364700270,<<"xz-img-01/51,0,50,5948,2999,2000,433c6594.jpg">>},-1,-1,false,13459623},30000]}}}}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:38.378746 +0800    1547046818  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:38.380149 +0800    1547046818  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_50004_2,{get,{123201639718953384550134771689223183842,<<"xz-img-01/51,0,30,18212,2666,2000,695775a4.jpg">>},-1,-1,false,14270458},30000]}}}}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:38.571475 +0800    1547046818  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{278625753006254488613486616696364700270,<<"xz-img-01/51,0,50,5948,2999,2000,433c6594.jpg">>},-1,-1,false,13460273},30000]}}}}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:41.169871 +0800    1547046821  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{309428713511234830042129322128762777919,<<"xz-img-01/51,0,12,9422,2697,2000,3cc29bb8.jpg">>},-1,-1,false,13859942},30000]}}}}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:42.192968 +0800    1547046822  leo_gateway_rpc_handler:handle_error/5  298 [{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,put},{cause,"Replicate failure"}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:42.378423 +0800    1547046822  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_1,{get,{221241018722809845335291404478752879363,<<"xz-img-01/51,0,31,5897,2000,1333,42564455.jpg">>},-1,-1,false,13464080},30000]}}}}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:42.479124 +0800    1547046822  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_1,{get,{221241018722809845335291404478752879363,<<"xz-img-01/51,0,31,5897,2000,1333,42564455.jpg">>},-1,-1,false,13464180},30000]}}}}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:43.77648 +0800 1547046823  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{278625753006254488613486616696364700270,<<"xz-img-01/51,0,50,5948,2999,2000,433c6594.jpg">>},-1,-1,false,13464779},30000]}}}}]
[W] gateway_01@10.0.1.103   2019-01-09 23:13:44.67854 +0800 1547046824  leo_gateway_rpc_handler:handle_error/5  303 [{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{278625753006254488613486616696364700270,<<"xz-img-01/51,0,50,5948,2999,2000,433c6594.jpg">>},-1,-1,false,13465769},30000]}}}}]

Finally, the error log of my storage node:

[W] storage_01@10.0.1.106   2019-01-09 23:26:04.334265 +0800    1547047564  leo_storage_read_repairer:loop/6    122 [{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:04.334730 +0800    1547047564  leo_storage_read_repairer:compare/4 174 [{node,'storage_03@10.0.1.108'},{addr_id,28594113901002211803075798585761706491},{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{clock,1542114352087639},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:04.334769 +0800    1547047564  leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,28594113901002211803075798585761706491},{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{clock,1542114352087639},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:04.596170 +0800    1547047564  leo_storage_read_repairer:loop/6    122 [{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:04.596268 +0800    1547047564  leo_storage_read_repairer:compare/4 174 [{node,'storage_05@10.0.1.110'},{addr_id,327593916421263477271857669055578618711},{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{clock,1545889560965447},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:04.596306 +0800    1547047564  leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,327593916421263477271857669055578618711},{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{clock,1545889560965447},{cause,timeout}]
[E] storage_01@10.0.1.106   2019-01-09 23:26:04.676252 +0800    1547047564  gen_server:call 0   gen_server leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in gen_server:call/2 line 204
[E] storage_01@10.0.1.106   2019-01-09 23:26:04.676440 +0800    1547047564  gen_server:call 0   CRASH REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in gen_server:terminate/7 line 812
[E] storage_01@10.0.1.106   2019-01-09 23:26:04.676641 +0800    1547047564  gen_server:call 0   Supervisor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.22727.13> exit with reason no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in context child_terminated
[W] storage_01@10.0.1.106   2019-01-09 23:26:07.231076 +0800    1547047567  leo_storage_handler_object:read_and_repair_3/3  1434    [{key,<<"xz-img-01/51,0,46,3074,2666,2000,f0394644.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:08.218021 +0800    1547047568  leo_storage_handler_object:read_and_repair_3/3  1434    [{key,<<"xz-img-01/51,0,46,3074,2666,2000,f0394644.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:08.654064 +0800    1547047568  leo_storage_handler_object:read_and_repair_3/3  1434    [{key,<<"xz-img-01/51,0,86,12634,1800,1200,76c06fc3.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:08.832096 +0800    1547047568  leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,28594113901002211803075798585761706491},{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{clock,1542114352087639},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:08.832555 +0800    1547047568  leo_storage_read_repairer:loop/6    122 [{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:08.832622 +0800    1547047568  leo_storage_read_repairer:compare/4 174 [{node,'storage_03@10.0.1.108'},{addr_id,28594113901002211803075798585761706491},{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{clock,1542114352087639},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:09.96855 +0800 1547047569  leo_storage_read_repairer:compare/4 174 [{node,'storage_05@10.0.1.110'},{addr_id,327593916421263477271857669055578618711},{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{clock,1545889560965447},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_1_0,{head_with_check_avs,{327593916421263477271857669055578618711,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},check_header,14339443},30000]}}}}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:09.97249 +0800 1547047569  leo_storage_read_repairer:loop/6    122 [{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:09.97303 +0800 1547047569  leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,327593916421263477271857669055578618711},{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{clock,1545889560965447},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:09.525111 +0800    1547047569  leo_storage_handler_object:read_and_repair_3/3  1434    [{key,<<"xz-img-01/51,0,86,12634,1800,1200,76c06fc3.jpg">>},{cause,timeout}]
[E] storage_01@10.0.1.106   2019-01-09 23:26:09.677330 +0800    1547047569  gen_server:call 0   gen_server leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in gen_server:call/2 line 204
[E] storage_01@10.0.1.106   2019-01-09 23:26:09.678023 +0800    1547047569  gen_server:call 0   CRASH REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in gen_server:terminate/7 line 812
[E] storage_01@10.0.1.106   2019-01-09 23:26:09.678264 +0800    1547047569  gen_server:call 0   Supervisor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.22827.13> exit with reason no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in context child_terminated
[E] storage_01@10.0.1.106   2019-01-09 23:26:10.36222 +0800 1547047570  null:null   0   {module,"leo_mq_consumer"},{function,"consume/4"},{line,535},{body,[{module,leo_storage_mq},{id,leo_sync_obj_with_dc_queue},{cause,{timeout,{gen_server,call,[leo_object_storage_read_10000_0,{get,{229460670587452068026601146147356987503,<<"xz-img-01/51,0,62,24332,132,132,6922bea2.jpg">>},-1,-1,false,15022113},30000]}}}]}
[W] storage_01@10.0.1.106   2019-01-09 23:26:10.200066 +0800    1547047570  leo_storage_handler_object:read_and_repair_3/3  1434    [{key,<<"xz-img-01/51,0,86,12634,1800,1200,76c06fc3.jpg">>},{cause,timeout}]
[W] storage_01@10.0.1.106   2019-01-09 23:26:11.965959 +0800    1547047571  leo_storage_read_repairer:compare/4 174 [{node,'storage_02@10.0.1.107'},{addr_id,28903003486663116147292874696914478445},{key,<<"xz-img-01/51,0,24,14729,1656,1242,e953b005.jpg">>},{clock,1545127149393637},{cause,timeout}]
[E] storage_01@10.0.1.106   2019-01-09 23:26:12.154656 +0800    1547047572  null:null   0   {module,"leo_pod_manager"},{function,"start_child/2"},{line,324},{body,{connection_error,{connection_error,econnrefused}}}
[E] storage_01@10.0.1.106   2019-01-09 23:26:12.155043 +0800    1547047572  null:null   0   Supervisor leo_rpc_client_storage_14_at_13077_sup had child leo_rpc_client_storage_14_at_13077 started with leo_pod_manager:start_link(leo_rpc_client_storage_14_at_13077, 8, 8, leo_rpc_client_conn, [storage_14,"10.16.1.74",13077,0], #Fun<leo_rpc_client_sup.0.73440599>) at undefined exit with reason {connection_error,{connection_error,econnrefused}} in context start_error
[E] storage_01@10.0.1.106   2019-01-09 23:26:12.155201 +0800    1547047572  null:null   0   CRASH REPORT Process <0.23087.13> with 1 neighbours exited with reason: {connection_error,{connection_error,econnrefused}} in gen_server:init_it/6 line 344
[W] storage_01@10.0.1.106   2019-01-09 23:26:12.155398 +0800    1547047572  null:null   0   {module,"leo_rpc_client_sup"},{function,"start_child/3"},{line,106},{body,{{shutdown,{failed_to_start_child,leo_rpc_client_storage_14_at_13077,{connection_error,{connection_error,econnrefused}}}},{child,undefined,leo_rpc_client_storage_14_at_13077,{leo_pod_sup,start_link,[leo_rpc_client_storage_14_at_13077,8,8,leo_rpc_client_conn,[storage_14,"10.16.1.74",13077,0],#Fun<leo_rpc_client_sup.0.73440599>]},permanent,10000,supervisor,[leo_pod_sup]}}}
[E] storage_01@10.0.1.106   2019-01-09 23:26:12.194692 +0800    1547047572  null:null   0   {module,"leo_pod_manager"},{function,"start_child/2"},{line,324},{body,{connection_error,{connection_error,econnrefused}}}
[E] storage_01@10.0.1.106   2019-01-09 23:26:12.194994 +0800    1547047572  null:null   0   Supervisor leo_rpc_client_storage_14_at_13077_sup had child leo_rpc_client_storage_14_at_13077 started with leo_pod_manager:start_link(leo_rpc_client_storage_14_at_13077, 8, 8, leo_rpc_client_conn, [storage_14,"10.16.1.74",13077,0], #Fun<leo_rpc_client_sup.0.73440599>) at undefined exit with reason {connection_error,{connection_error,econnrefused}} in context start_error
[E] storage_01@10.0.1.106   2019-01-09 23:26:12.195140 +0800    1547047572  null:null   0   CRASH REPORT Process <0.23091.13> with 1 neighbours exited with reason: {connection_error,{connection_error,econnrefused}} in gen_server:init_it/6 line 344
[W] storage_01@10.0.1.106   2019-01-09 23:26:12.195310 +0800    1547047572  null:null   0   {module,"leo_rpc_client_sup"},{function,"start_child/3"},{line,106},{body,{{shutdown,{failed_to_start_child,leo_rpc_client_storage_14_at_13077,{connection_error,{connection_error,econnrefused}}}},{child,undefined,leo_rpc_client_storage_14_at_13077,{leo_pod_sup,start_link,[leo_rpc_client_storage_14_at_13077,8,8,leo_rpc_client_conn,[storage_14,"10.16.1.74",13077,0],#Fun<leo_rpc_client_sup.0.73440599>]},permanent,10000,supervisor,[leo_pod_sup]}}}

NewDund commented 5 years ago

@mocchira This is my production environment. Please reply as soon as possible. Thank you very much.

NewDund commented 5 years ago

@mocchira Moreover, I found that occasionally mistake 500 appear, which I have not seen before. Can you help me analyze why 500 appears?

NewDund commented 5 years ago

@mocchira Also, I found a very strange problem.

One of my PUT requests, haproxy returned me 200, but my leo_gateway did not receive this PUT request. Is this possible?

If possible, how can I judge whether PUT is successful?

mocchira commented 5 years ago

@NewDund Sorry for the long delay. I can spare time to look into this tonight.

leo-project / leofs

Storage node overloads and returns 504, Very urgent !!! #1175