Closed osier closed 10 years ago
It would be great if you could share the logs of LeoFS gateway node and LeoFS storage node(s). So I'm able to check this issue.
Thank you for your report.
It's not always reproduced, I will catch the debug log and add it here once it's reproduced. Thanks.
Same situation here
status
[System config]
System version : 1.1.0
Cluster Id : leofs_1
DC Id : dc_1
Total replicas : 2
# of successes of R : 2
# of successes of W : 2
# of successes of D : 2
# of DC-awareness replicas : 0
ring size : 2^128
Current ring hash : 34a3fe49
Prev ring hash : 17527522
[Multi DC replication settings]
max # of joinable DCs : 2
# of replicas a DC : 1
[Node(s) state]
-------+------------------------------+--------------+----------------+----------------+----------------------------
type | node | state | current ring | prev ring | updated at
-------+------------------------------+--------------+----------------+----------------+----------------------------
S | storage_0@192.168.24.80 | running | 34a3fe49 | 17527522 | 2014-08-14 07:14:17 +0700
S | storage_1@192.168.24.81 | stop | | | 2014-08-17 03:18:25 +0700
S | storage_2@192.168.24.82 | running | 34a3fe49 | 17527522 | 2014-08-14 18:00:07 +0700
S | storage_3@192.168.24.83 | running | 34a3fe49 | 17527522 | 2014-08-15 09:40:05 +0700
G | gateway_0@192.168.24.80 | running | 5ef2c02f | 5ef2c02f | 2014-08-14 07:14:40 +0700
and here my leo-gateway log while I create a new folder from dragon disk
[W] gateway_0@192.168.24.80 2014-08-17 03:24:29.793998 +0700 1408220669 leo_gateway_rpc_handler:handle_error/5 255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:head, cause:nodedown
[W] gateway_0@192.168.24.80 2014-08-17 03:24:32.793786 +0700 1408220672 leo_gateway_rpc_handler:handle_error/5 255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:32.795592 +0700 1408220672 leo_gateway_s3_api:handle_2/6 551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:35.793779 +0700 1408220675 leo_gateway_rpc_handler:handle_error/5 255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:35.795311 +0700 1408220675 leo_gateway_s3_api:handle_2/6 551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:38.795507 +0700 1408220678 leo_gateway_rpc_handler:handle_error/5 255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:38.813896 +0700 1408220678 leo_gateway_s3_api:handle_2/6 551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:41.793983 +0700 1408220681 leo_gateway_rpc_handler:handle_error/5 255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:41.794850 +0700 1408220681 leo_gateway_s3_api:handle_2/6 551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:44.856817 +0700 1408220684 leo_gateway_rpc_handler:handle_error/5 255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:44.858454 +0700 1408220684 leo_gateway_s3_api:handle_2/6 551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:47.842020 +0700 1408220687 leo_gateway_rpc_handler:handle_error/5 255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:47.843686 +0700 1408220687 leo_gateway_s3_api:handle_2/6 551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:50.841871 +0700 1408220690 leo_gateway_rpc_handler:handle_error/5 255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:50.842752 +0700 1408220690 leo_gateway_s3_api:handle_2/6 551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:53.922133 +0700 1408220693 leo_gateway_rpc_handler:handle_error/5 255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:53.923060 +0700 1408220693 leo_gateway_s3_api:handle_2/6 551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:56.922258 +0700 1408220696 leo_gateway_rpc_handler:handle_error/5 255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:56.934069 +0700 1408220696 leo_gateway_s3_api:handle_2/6 551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
No error log in leo_storage
@adhioutlined Thank you for sharing the log. I'm going to check this issue from today.
@adhioutlined I've see the consistency level. "Total replica = 2" and "# of successes of WRITE = 2", PUT amost certainly fail in this setup when exists stopped or suspended node(s).
Total replicas : 2
# of successes of R : 2
# of successes of W : 2
# of successes of D : 2
I recommend consistency-level is as follows:
Total replicas : 2
# of successes of R : 1
# of successes of W : 1
# of successes of D : 1
Total replicas : 3
# of successes of R : 1
# of successes of W : 2
# of successes of D : 2
I closed this issue. If you face this issue again, please reopen it.
I manually shutdown one storage node during the testing (the client keeping putting object into one bucket), to emulate hardware failure which causes storage node down in production. And got "Internal error" like below:
It's fine and expected to have a slower speed for the request in such case, but failed request is not quite good in practice.
Also, is there a way to monitor the recovering (I mean the process to recover to the right number of the replicas) in this case, i.e. a storage node is down and out.
Please correct me if I'm wrong.
Regards, Osier