leo-project / leofs

The LeoFS Storage System
https://leo-project.net/leofs/
Apache License 2.0
1.55k stars 155 forks source link

Got "Internal error" when one storage node was down. #217

Closed osier closed 10 years ago

osier commented 10 years ago

I manually shutdown one storage node during the testing (the client keeping putting object into one bucket), to emulate hardware failure which causes storage node down in production. And got "Internal error" like below:

create object image_file59339
done
create object image_file59340
done
create object image_file59341
Traceback (most recent call last):
  File "s3.py", line 30, in <module>
    s3_object.set_contents_from_filename("/usr/bin/openssl")
  File "/usr/lib/python2.7/dist-packages/boto/s3/key.py", line 1306, in set_contents_from_filename
    encrypt_key=encrypt_key)
  File "/usr/lib/python2.7/dist-packages/boto/s3/key.py", line 1237, in set_contents_from_file
    chunked_transfer=chunked_transfer, size=size)
  File "/usr/lib/python2.7/dist-packages/boto/s3/key.py", line 722, in send_file
    chunked_transfer=chunked_transfer, size=size)
  File "/usr/lib/python2.7/dist-packages/boto/s3/key.py", line 905, in _send_file_internal
    query_args=query_args
  File "/usr/lib/python2.7/dist-packages/boto/s3/connection.py", line 547, in make_request
    retry_handler=retry_handler
  File "/usr/lib/python2.7/dist-packages/boto/connection.py", line 1017, in make_request
    retry_handler=retry_handler)
  File "/usr/lib/python2.7/dist-packages/boto/connection.py", line 973, in _mexe
    raise BotoServerError(response.status, response.reason, body)
boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error
<?xml version="1.0" encoding="UTF-8"?><Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><Resource>catdog1/image_file</Resource><RequestId></RequestId></Error>

It's fine and expected to have a slower speed for the request in such case, but failed request is not quite good in practice.

Also, is there a way to monitor the recovering (I mean the process to recover to the right number of the replicas) in this case, i.e. a storage node is down and out.

Please correct me if I'm wrong.

Regards, Osier

yosukehara commented 10 years ago

It would be great if you could share the logs of LeoFS gateway node and LeoFS storage node(s). So I'm able to check this issue.

Thank you for your report.

osier commented 10 years ago

It's not always reproduced, I will catch the debug log and add it here once it's reproduced. Thanks.

adhioutlined commented 10 years ago

Same situation here

status
[System config]
                System version : 1.1.0
                    Cluster Id : leofs_1
                         DC Id : dc_1
                Total replicas : 2
           # of successes of R : 2
           # of successes of W : 2
           # of successes of D : 2
 # of DC-awareness replicas    : 0
                     ring size : 2^128
             Current ring hash : 34a3fe49
                Prev ring hash : 17527522
[Multi DC replication settings]
         max # of joinable DCs : 2
            # of replicas a DC : 1

[Node(s) state]
-------+------------------------------+--------------+----------------+----------------+----------------------------
 type  |             node             |    state     |  current ring  |   prev ring    |          updated at
-------+------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_0@192.168.24.80      | running      | 34a3fe49       | 17527522       | 2014-08-14 07:14:17 +0700
  S    | storage_1@192.168.24.81      | stop         |                |                | 2014-08-17 03:18:25 +0700
  S    | storage_2@192.168.24.82      | running      | 34a3fe49       | 17527522       | 2014-08-14 18:00:07 +0700
  S    | storage_3@192.168.24.83      | running      | 34a3fe49       | 17527522       | 2014-08-15 09:40:05 +0700
  G    | gateway_0@192.168.24.80      | running      | 5ef2c02f       | 5ef2c02f       | 2014-08-14 07:14:40 +0700

and here my leo-gateway log while I create a new folder from dragon disk

[W] gateway_0@192.168.24.80 2014-08-17 03:24:29.793998 +0700    1408220669  leo_gateway_rpc_handler:handle_error/5  255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:head, cause:nodedown
[W] gateway_0@192.168.24.80 2014-08-17 03:24:32.793786 +0700    1408220672  leo_gateway_rpc_handler:handle_error/5  255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:32.795592 +0700    1408220672  leo_gateway_s3_api:handle_2/6   551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:35.793779 +0700    1408220675  leo_gateway_rpc_handler:handle_error/5  255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:35.795311 +0700    1408220675  leo_gateway_s3_api:handle_2/6   551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:38.795507 +0700    1408220678  leo_gateway_rpc_handler:handle_error/5  255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:38.813896 +0700    1408220678  leo_gateway_s3_api:handle_2/6   551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:41.793983 +0700    1408220681  leo_gateway_rpc_handler:handle_error/5  255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:41.794850 +0700    1408220681  leo_gateway_s3_api:handle_2/6   551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:44.856817 +0700    1408220684  leo_gateway_rpc_handler:handle_error/5  255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:44.858454 +0700    1408220684  leo_gateway_s3_api:handle_2/6   551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:47.842020 +0700    1408220687  leo_gateway_rpc_handler:handle_error/5  255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:47.843686 +0700    1408220687  leo_gateway_s3_api:handle_2/6   551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:50.841871 +0700    1408220690  leo_gateway_rpc_handler:handle_error/5  255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:50.842752 +0700    1408220690  leo_gateway_s3_api:handle_2/6   551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:53.922133 +0700    1408220693  leo_gateway_rpc_handler:handle_error/5  255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:53.923060 +0700    1408220693  leo_gateway_s3_api:handle_2/6   551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}
[W] gateway_0@192.168.24.80 2014-08-17 03:24:56.922258 +0700    1408220696  leo_gateway_rpc_handler:handle_error/5  255 node:'storage_1@192.168.24.81', mod:leo_storage_handler_object, method:put, cause:nodedown
[E] gateway_0@192.168.24.80 2014-08-17 03:24:56.934069 +0700    1408220696  leo_gateway_s3_api:handle_2/6   551 path:adhi/yoih/, cause:{{case_clause,{error,"Replicate failure"}},[{leo_gateway_http_commons,put_small_object,3,[{file,"src/leo_gateway_http_commons.erl"},{line,484}]},{leo_gateway_s3_api,handle_2,6,[{file,"src/leo_gateway_s3_api.erl"},{line,548}]},{cowboy_handler,handler_handle,4,[{file,"src/cowboy_handler.erl"},{line,119}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,523}]}]}

No error log in leo_storage

yosukehara commented 10 years ago

@adhioutlined Thank you for sharing the log. I'm going to check this issue from today.

yosukehara commented 10 years ago

@adhioutlined I've see the consistency level. "Total replica = 2" and "# of successes of WRITE = 2", PUT amost certainly fail in this setup when exists stopped or suspended node(s).

        Total replicas : 2
   # of successes of R : 2
   # of successes of W : 2
   # of successes of D : 2

I recommend consistency-level is as follows:

        Total replicas : 2
   # of successes of R : 1
   # of successes of W : 1
   # of successes of D : 1
        Total replicas : 3
   # of successes of R : 1
   # of successes of W : 2
   # of successes of D : 2
yosukehara commented 10 years ago

I closed this issue. If you face this issue again, please reopen it.