leo-project / leofs

The LeoFS Storage System
https://leo-project.net/leofs/
Apache License 2.0
1.55k stars 155 forks source link

INCONSISTENT HASH | manager node and other nodes. #1195

Open varuntanwar opened 5 years ago

varuntanwar commented 5 years ago

Hi,

We have 3 Nodes running. LEOFS version: 1.3.8-1

root@prd-leofs001:/home/varun.tanwar# leofs-adm status
 [System Confiuration]
-----------------------------------+----------
 Item                              | Value
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.3.8
                        cluster Id | leofs_1
                             DC Id | dc_1
                    Total replicas | 2
          number of successes of R | 1
          number of successes of W | 1
          number of successes of D | 1
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash |
                previous ring-hash |
-----------------------------------+----------

 [State of Node(s)]
-------+--------------------------------------------+--------------+----------------+----------------+----------------------------
 type  |                    node                    |    state     |  current ring  |   prev ring    |          updated at
-------+--------------------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_0@prd-leofs001         | running      | e92344d1       | e92344d1       | 2019-09-16 16:05:13 +0550
  S    | storage_1@prd-leofs002         | running      | e92344d1       | e92344d1       | 2019-09-16 15:33:49 +0550
  S    | storage_2@prd-leofs003         | running      | e92344d1       | e92344d1       | 2019-09-16 15:34:27 +0550
  G    | gateway_http@prd-leofs003      | running      | e92344d1       | e92344d1       | 2019-09-16 15:35:27 +0550
  G    | gateway_s3@prd-leofs001        | running      | e92344d1       | e92344d1       | 2019-09-16 15:34:54 +0550
-------+--------------------------------------------+--------------+----------------+----------------+----------------------------

I have tried restarting all services (stop gateway -> stop storage -> stop manager then start in the reverse order : manager -> storage -> gateway).

When I run where-is command, I getting following error:

leofs-adm whereis s3:/// [ERROR] Could not get ring

When I configure the s3cmd, I get following error:

Test access with supplied credentials? [Y/n] y Please wait, attempting to list all buckets... ERROR: Test failed: 403 (AccessDenied): Access Denied ERROR: Are you sure your keys have s3:ListAllMyBuckets permissions?

but when I upload file via s3cmd, it succeeds.

yosukehara commented 5 years ago

Ring may be broken as below:

 Manager RING hash
-----------------------------------+----------
                 current ring-hash |
                previous ring-hash |
-----------------------------------+----------

You need to rebuild Ring. Refer to my comments below:

varuntanwar commented 5 years ago

@yosukehara

On the same thread, one user mentioned that after following the steps all users and buckets disappeared.

Should we go ahead? We can't afford any data loss.

varuntanwar commented 5 years ago

I have another question:

When I configure the s3cmd, I get following error:

Test access with supplied credentials? [Y/n] y
Please wait, attempting to list all buckets...
ERROR: Test failed: 403 (AccessDenied): Access Denied
ERROR: Are you sure your keys have s3:ListAllMyBuckets permissions?

but when I upload/download file via s3cmd, it succeeds.

Is this because of broken ring?

yosukehara commented 5 years ago

On the same thread, one user mentioned that after following the steps all users and buckets disappeared. Should we go ahead? We can't afford any data loss.

I can only provide this solution. I recommend that you build a test environment and test this solution.

wulfgar93 commented 4 years ago

@yosukehara got the same problem. I tried to fix the managers RING by this solution https://github.com/leo-project/leofs/issues/1185#issuecomment-495033961 but old RING did not restored, new RING was generated. So I restored the mnesia on leomanagers nodes. I found another solution https://github.com/leo-project/leofs/issues/1078#issuecomment-406518483 and it helps, but if I restart leomanager service the RING becomes broken again.

wulfgar93 commented 4 years ago

@yosukehara I can add the repeating error massage in leomanager error log: [W] manager_0@10.6.0.40 2020-02-29 20:55:44.439025 +0300 1582998944 leo_manager_api:brutal_synchronize_ring_1/2 2143 [{cause,invalid_db}]

yosukehara commented 4 years ago

@wulfgar93 I've recognized that it seems to be an error in your LeoFS setting because as below:

leo_manager_api:brutal_synchronize_ring_1/2 2143 [{cause,invalid_db}]

This error is occurring at leo_redundant_manager/src/leo_cluster_tbl_member.erl#L190-L191. Cloud you review the settings of your LeoFS with the reference to this page ?

wulfgar93 commented 4 years ago

@yosukehara I did not find any strange settings in manager's conf file. Could you please take a look on it's settings?

sasl.sasl_error_log = /var/log/leofs/leo_manager_0/sasl/sasl-error.log
sasl.errlog_type = error
sasl.error_logger_mf_dir = /var/log/leofs/leo_manager_0/sasl
sasl.error_logger_mf_maxbytes = 10485760
sasl.error_logger_mf_maxfiles = 5

manager.partner = manager_1@10.6.0.41
console.bind_address = 10.10.0.40
console.port.cui  = 10010
console.port.json = 10020
console.acceptors.cui = 3
console.acceptors.json = 16
console.histories.num_of_display = 200

system.dc_id = dc_1
system.cluster_id = leofs_1
consistency.num_of_replicas = 3
consistency.write = 2
consistency.read = 1
consistency.delete = 2
consistency.rack_aware_replicas = 0

mnesia.dir = /var/lib/leofs/work/mnesia/10.6.0.40
mnesia.dump_log_write_threshold = 50000
mnesia.dc_dump_limit = 40

log.log_level = 2
log.erlang = /var/log/leofs/leo_manager_0/erlang
log.app = /var/log/leofs/leo_manager_0/app
log.member_dir = /var/log/leofs/leo_manager_0/ring
log.ring_dir = /var/log/leofs/leo_manager_0/ring

queue_dir = /var/lib/leofs/work/queue
snmp_agent = /etc/leofs/leo_manager_0/snmpa_manager_0/LEO-MANAGER

rpc.server.acceptors = 16
rpc.server.listen_port = 13075
rpc.server.listen_timeout = 5000
rpc.client.connection_pool_size = 16
rpc.client.connection_buffer_size = 16

nodename = manager_0@10.6.0.40
distributed_cookie = <censored>
erlang.kernel_poll = true
erlang.asyc_threads = 32
erlang.max_ports = 64000
erlang.crash_dump = /var/log/leofs/leo_manager_0/erl_crash.dump
erlang.max_ets_tables = 256000
erlang.smp = enable
process_limit = 1048576
snmp_conf = /etc/leofs/leo_manager_0/leo_manager_snmp.config
wulfgar93 commented 3 years ago

@yosukehara

yosukehara commented 3 years ago

@wulfgar93 I'm sorry for the delay in replying. Let me know LeoStorage's configuration of your LeoFS (Not LeoManager configuration).

wulfgar93 commented 3 years ago

@yosukehara

# cat /etc/leofs/leo_storage/leo_storage.d/20-leo_storage.conf
sasl.sasl_error_log = /var/log/leofs/leo_storage/sasl/sasl-error.log
sasl.errlog_type = error
sasl.error_logger_mf_dir = /var/log/leofs/leo_storage/sasl
sasl.error_logger_mf_maxbytes = 10485760
sasl.error_logger_mf_maxfiles = 5

managers = [manager_0@10.6.0.40, manager_1@10.6.0.41]

obj_containers.path = [/mnt/storage/leofs/leo_storage]
obj_containers.num_of_containers = [128]

obj_containers.metadata_storage = leveldb

object_storage.is_strict_check = true

watchdog.cpu.threshold_cpu_load_avg = 8.0

mq.backend_db = leveldb
mq.num_of_mq_procs = 8

backend_db.eleveldb.write_buf_size = 100663296
backend_db.eleveldb.max_open_files = 1000
backend_db.eleveldb.sst_block_size = 4096

log.erlang = /var/log/leofs/leo_storage/erlang
log.app = /var/log/leofs/leo_storage/app
log.member_dir = /var/log/leofs/leo_storage/ring
log.ring_dir = /var/log/leofs/leo_storage/ring
queue_dir  = /var/lib/leofs/leo_storage/work/queue
leo_ordning_reda.temp_stacked_dir = "/mnt/storage/leofs/leo_storage/work/ord_reda/"

nodename = stor0snode0@10.6.0.60
distributed_cookie = <censored>

erlang.crash_dump = /var/log/leofs/leo_storage/erl_crash.dump

snmp_conf = /etc/leofs/leo_storage/leo_storage_snmp.config

autonomic_op.compaction.is_enabled = true
autonomic_op.compaction.warn_active_size_ratio = 92
autonomic_op.compaction.threshold_active_size_ratio = 90
yosukehara commented 3 years ago

I’ve confirmed LeoManager and LeoStorage configuration files. But I could not find an issue.

So I recommend that you rebuild your LeoFS’ ring. I will share its procedure soon.

yosukehara commented 3 years ago

I've remembered that I recommended the same solution last year - https://github.com/leo-project/leofs/issues/1195#issuecomment-532504776