leofs-storage {cause,not_found}

varuntanwar commented 5 years ago

A few days back we had added a few nodes in the leofs cluster, but we removed them later. But it is still looking for the the removed nodes and throwing the following error logs:

[W] storage_101@prd-leofs101 2018-11-03 11:44:48.578937 +0550 1541225688 leo_storage_mq:rebalance_1/1 912 [{node,'storage_107@prd-leofs107'},{addr_id,38174905918673712019893746042006228611},{key,<<"datastore/report/dbe8d1bf-b5db-4b25-bcfg-146f2762ec7g">>},{cause,not_found}]

why it is still looking for the the removed nodes? I have executed the rebalancing command after removing the nodes. Is there a caching mechanism in the backend which I need to clear/refresh? Restarting the leofs cluster is not preferable.

varuntanwar commented 5 years ago

I have 5 nodes in the cluster: storage_106 to storage110 and removed all of them. But error is coming for only 3 nodes storage{106..108}

mocchira commented 5 years ago

@varuntanwar Can you provide us with the information described on https://github.com/leo-project/leofs/wiki/template-of-an-issue-report to look into further your problem?

varuntanwar commented 5 years ago

Purpose : Production Environments :

LeoFS Version : 1.3.8
Erlang Version : Erlang/OTP 18
Kind of virtualisation : Xen
Operating system : Ubuntu 16
Processor Architecture:

processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : QEMU Virtual CPU version 2.5+ stepping : 3 microcode : 0x1 cpu MHz : 1995.312 cache size : 4096 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni vmx cx16 x2apic hypervisor lahf_lm tpr_shadow vnmi flexpriority ept vpid bugs : bogomips : 3990.62 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : QEMU Virtual CPU version 2.5+ stepping : 3 microcode : 0x1 cpu MHz : 1995.312 cache size : 4096 KB physical id : 1 siblings : 1 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni vmx cx16 x2apic hypervisor lahf_lm tpr_shadow vnmi flexpriority ept vpid bugs : bogomips : 3990.62 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : QEMU Virtual CPU version 2.5+ stepping : 3 microcode : 0x1 cpu MHz : 1995.312 cache size : 4096 KB physical id : 2 siblings : 1 core id : 0 cpu cores : 1 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni vmx cx16 x2apic hypervisor lahf_lm tpr_shadow vnmi flexpriority ept vpid bugs : bogomips : 3990.62 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : QEMU Virtual CPU version 2.5+ stepping : 3 microcode : 0x1 cpu MHz : 1995.312 cache size : 4096 KB physical id : 3 siblings : 1 core id : 0 cpu cores : 1 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni vmx cx16 x2apic hypervisor lahf_lm tpr_shadow vnmi flexpriority ept vpid bugs : bogomips : 3990.62 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

Memory:

MemTotal: 8175088 kB MemFree: 356912 kB MemAvailable: 6645160 kB Buffers: 76932 kB Cached: 6245240 kB SwapCached: 0 kB Active: 5309928 kB Inactive: 2150408 kB Active(anon): 1141116 kB Inactive(anon): 252 kB Active(file): 4168812 kB Inactive(file): 2150156 kB Unevictable: 3652 kB Mlocked: 3652 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 34480 kB Writeback: 0 kB AnonPages: 1141892 kB Mapped: 164188 kB Shmem: 780 kB Slab: 303780 kB SReclaimable: 275256 kB SUnreclaim: 28524 kB KernelStack: 10064 kB PageTables: 11592 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 4087544 kB Committed_AS: 3378900 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 737280 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 94076 kB DirectMap2M: 8294400 kB

What happened :

We added 5 nodes in the cluster to increase the capacity of the cluster (executed the rebalancing). After adding nodes, most of the files were not accessible (replication factor : 2). So, we removed the nodes one by one and ran the rebalancing again. After that, error rate dropped but still few files are not accessible.

After removing the nodes, we can still see in the storage logs that is trying to write in those nodes which we removed. [W] storage_101@prd-leofs101 2018-11-03 11:44:48.578937 +0550 1541225688 leo_storage_mq:rebalance_1/1 912 [{node,'storage_107@prd-leofs107'},{addr_id,38174905918673712019893746042006228611},{key,<<"datastore/report/dbe8d1bf-b5db-4b25-bcfg-146f2762ec7g">>},{cause,not_found}]

System Information :

[System Confiuration]
-----------------------------------+----------
 Item                              | Value
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.3.8
                        cluster Id | leofs_1
                             DC Id | dc_1
                    Total replicas | 2
          number of successes of R | 1
          number of successes of W | 1
          number of successes of D | 1
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash | a86bd425
                previous ring-hash | a86bd425
-----------------------------------+----------

 [State of Node(s)]
-------+-------------------------------------------+--------------+----------------+----------------+----------------------------
 type  |                   node                    |    state     |  current ring  |   prev ring    |          updated at
-------+-------------------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_101@prd-leofs101      | running      | a86bd425       | a86bd425       | 2018-11-02 23:46:22 +0550
  S    | storage_102@prd-leofs102      | running      | a86bd425       | a86bd425       | 2018-08-06 13:48:16 +0550
  S    | storage_103@prd-leofs103      | running      | a86bd425       | a86bd425       | 2018-08-06 13:48:16 +0550
  S    | storage_104@prd-leofs104     | running      | a86bd425       | a86bd425       | 2018-11-02 22:57:35 +0550
  S    | storage_105@prd-leofs105      | running      | a86bd425       | a86bd425       | 2018-09-06 19:29:07 +0550
  G    | gateway_102@prd-leofs102      | running      | a86bd425       | a86bd425       | 2018-08-06 13:48:45 +0550
  G    | gateway_103@prd-leofs103     | running      | a86bd425       | a86bd425       | 2018-09-06 19:39:03 +0550
-------+-------------------------------------------+--------------+----------------+----------------+----------------------------

mocchira commented 5 years ago

@varuntanwar Thanks for sharing the detailed info.

After removing the nodes, we can still see in the storage logs that is trying to write in those nodes which we removed. [W] storage_101@prd-leofs101 2018-11-03 11:44:48.578937 +0550 1541225688 leo_storage_mq:rebalance_1/1 912 [{node,'storage_107@prd-leofs107'},{addr_id,38174905918673712019893746042006228611},{key,<<"datastore/report/dbe8d1bf-b5db-4b25-bcfg-146f2762ec7g">>},{cause,not_found}]

The reason why you can still see such log entries is that there are still some messages in the queue files which keep trying to rebalance (probably based on your first attempt) so stopping leo_storage and deleting all queue files on leo_storage and restarting on all leo_storage nodes might solve your problem.

(The queue files are stored under the path specified by https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L337 in leo_storage.conf)

We added 5 nodes in the cluster to increase the capacity of the cluster (executed the rebalancing). After adding nodes, most of the files were not accessible (replication factor : 2). So, we removed the nodes one by one and ran the rebalancing again. After that, error rate dropped but still few files are not accessible.

Let me confirm that did you try to add 5 nodes at once? then you might have some trouble especially in case replication factor = 2. Instead I'd recommend you try to add a node and rebalance one by one then the problem you faced at the previous attempt (most of the files were not accessible) should not happen.

mocchira commented 5 years ago

Notes: As for the first failure (after rebalancing with 5 storage nodes at once, many objects were inaccessible) @varuntanwar faced, we are looking into the root cause now. Once we find the root cause, we'll file it as another issue later.

varuntanwar commented 5 years ago

The reason why you can still see such log entries is that there are still some messages in the queue files which keep trying to rebalance

Will this cause any data loss if we don't restart the storage nodes ?

Let me confirm that did you try to add 5 nodes at once? then you might have some trouble especially in case replication factor = 2.

Yes. So we removed the nodes immediately and did rebalance one by one. After that error rate dropped but still few files are not accessible.

mocchira commented 5 years ago

Will this cause any data loss if we don't restart the storage nodes ?

Theoretically no data loss will happen although you keep seeing the error log lines forever without wiping out the queue files and restarting the storage nodes.

Yes. So we removed the nodes immediately and did rebalance one by one. After that error rate dropped but still few files are not accessible.

What will "leofs-adm whereis" show against a few files which are not accessible now?

varuntanwar commented 5 years ago

What will "leofs-adm whereis" show against a few files which are not accessible now?

root@prd-leofs101:/home/varun.tanwar# leofs-adm whereis datastore/report/4f8a5b00-f284-47a5-aed3-2d9a3a59f8f1
-------+-------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |                   node                    |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+-------------------------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | storage_104@prd-leofs104                  |                                      |            |              |                |                |                |
       | storage_103@prd-leofs103                  |                                      |            |              |                |                |                |

Even for the files which are accessible, we are not getting any details ( Ring address, size, checksum and other details) in whereis command

mocchira commented 5 years ago

Even for the files which are accessible, we are not getting any details ( Ring address, size, checksum and other details) in whereis command

Let me confirm that means you can't get any details of all files stored in your cluster regardless of whether it's accessible or not? then it might be the symptom that the cluster RING has been broken for some reason. Can you provide us with the RING information by using "leofs-adm dump-ring"?

After executing "dump-ring", you can see the following files under the path specified by https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L328

leofs@cat2neat:leofs.dev$ ls -al package/leo_storage_0/log/ring/
drwxrwxr-x 2 leofs leofs   4096 Nov  7 16:56 .
-rw-rw-r-- 1 leofs leofs    575 Nov  7 16:56 members_cur.dump.63708796585
-rw-rw-r-- 1 leofs leofs    230 Nov  7 16:56 members_prv.dump.63708796585
-rw-rw-r-- 1 leofs leofs  70225 Nov  7 16:56 ring_cur.dump.63708796585
-rw-rw-r-- 1 leofs leofs 220689 Nov  7 16:56 ring_cur_worker.log.63708796585
-rw-rw-r-- 1 leofs leofs  28075 Nov  7 16:56 ring_prv.dump.63708796585
-rw-rw-r-- 1 leofs leofs 291389 Nov  7 16:56 ring_prv_worker.log.63708796586

Attaching files on the comment on this issue would be helpful to us.

varuntanwar commented 5 years ago

Let me confirm that means you can't get any details of all files stored in your cluster regardless of whether it's accessible or not?

Earlier I was only getting the errors with the few files but right now I am not able to reproduce the issue. I can fetch the details of the files which are accessible.

But not getting any details for the files which are not accessible. Seems like somehow those files got deleted from the cluster

mocchira commented 5 years ago

@varuntanwar

But not getting any details for the files which are not accessible. Seems like somehow those files got deleted from the cluster

This is the problem.

As you may know, we have filed a few issues recently like

1157
1158
1159

All of these are related to the problem which can happen in case running rebalance with multiple storage nodes at once, that being said, we suspect that data loss could happen in that cases so we will prohibit "rebalance with multiple nodes at once" and also document (Don't rebalance with multiple nodes until the next stable release 1.4.3 comes out) about it on our official website.

Sorry for the inconvenience.

As for the files which might be deleted on your cluster, if you haven't run the compaction yet, there is a way to recover those files from the cluster (actually those files are logically deleted so we can retrieve the original content) so please let us know if you need the help.

varuntanwar commented 5 years ago

As for the files which might be deleted on your cluster, if you haven't run the compaction yet, there is a way to recover those files from the cluster (actually those files are logically deleted so we can retrieve the original content) so please let us know if you need the help.

Yes, I want to recover those files. I haven't run the compaction. I have executed the recover-file and recover-node(only on 2 nodes) but no luck.

varuntanwar commented 5 years ago

Also, for this bucket, s3cmd ls is not working.

s3cmd ls s3://datastore/
WARNING: Retrying failed request: /?delimiter=/ ('')
WARNING: Waiting 3 sec...

It is working fine for other cluster.

mocchira commented 5 years ago

Yes, I want to recover those files. I haven't run the compaction. I have executed the recover-file and recover-node(only on 2 nodes) but no luck.

OK. At first, please refer to https://leo-project.net/leofs/docs/admin/system_operations/data/#diagnosis . As you can see, "leofs-adm diagnose-start" allows you to dump the object list including its offset, filename, size, etc.. so if you know the filename to be recovered then you can retrieve the original content based on its offset and size.

Steps

"leofs-adm whereis $PATH_TO_FILE" to grasp which nodes are supposed to have the file
"leofs-adm diagnose-start $NODE_SUPPOSED_TO_HAVE_FILE"
Log into $NODE_SUPPOSED_TO_HAVE_FILE and check out the path in which the result of "diagnose-start" are dumped (https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L48 ${obj_containers.path}/log is where the result files exist. its filename looks like leo_object_storage_0.20181113.14.2 and there are https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L49 num_of_containers files so you would have to check all files till you find the filename as described below
"grep" the dumped file with $PATH_TO_FILE to find the line where its offset and size exists
Once you find the line, check the below three items
- the name of dumped file (eg. leo_object_storage_0.20181113.14.2)
- the offset found in the dumped file
- the size found in the dumped file
Since the name of dumped file are formatted in leo_objectstorage${CONTAINER_ID}.${YYYYMMDD}.${HI}.${SEQ_ID}, you get to know ${CONTAINER_ID}

the raw filename including the content of $PATH_TO_FILE would be ${obj_containers.path}/object/${CONTAINER_ID}.avs (the directory in which ${CONTAINER_ID}.avs exists looks like below

leofs@cat2neat:leofs.dev$ ls -al package/leo_storage_0/avs/object/
total 755208
drwxrwxr-x 2 leofs leofs      4096 Nov 13 14:02 .
drwxrwxr-x 5 leofs leofs      4096 Nov 13 14:02 ..
lrwxrwxrwx 1 leofs leofs        76 Nov 13 14:02 0.avs -> /home/leofs/dev/leofs.dev/package/leo_storage_0/avs/object/0.avs_63709304574
-rw-rw-r-- 1 leofs leofs  88746197 Nov 13 14:09 0.avs_63709304574
lrwxrwxrwx 1 leofs leofs        76 Nov 13 14:02 1.avs -> /home/leofs/dev/leofs.dev/package/leo_storage_0/avs/object/1.avs_63709304574
-rw-rw-r-- 1 leofs leofs  90892906 Nov 13 14:09 1.avs_63709304574
lrwxrwxrwx 1 leofs leofs        76 Nov 13 14:02 2.avs -> /home/leofs/dev/leofs.dev/package/leo_storage_0/avs/object/2.avs_63709304574
-rw-rw-r-- 1 leofs leofs  96704658 Nov 13 14:09 2.avs_63709304574
lrwxrwxrwx 1 leofs leofs        76 Nov 13 14:02 3.avs -> /home/leofs/dev/leofs.dev/package/leo_storage_0/avs/object/3.avs_63709304574
-rw-rw-r-- 1 leofs leofs 105305543 Nov 13 14:09 3.avs_63709304574
lrwxrwxrwx 1 leofs leofs        76 Nov 13 14:02 4.avs -> /home/leofs/dev/leofs.dev/package/leo_storage_0/avs/object/4.avs_63709304573
-rw-rw-r-- 1 leofs leofs  87515006 Nov 13 14:09 4.avs_63709304573
lrwxrwxrwx 1 leofs leofs        76 Nov 13 14:02 5.avs -> /home/leofs/dev/leofs.dev/package/leo_storage_0/avs/object/5.avs_63709304573
-rw-rw-r-- 1 leofs leofs 106672368 Nov 13 14:09 5.avs_63709304573
lrwxrwxrwx 1 leofs leofs        76 Nov 13 14:02 6.avs -> /home/leofs/dev/leofs.dev/package/leo_storage_0/avs/object/6.avs_63709304573
-rw-rw-r-- 1 leofs leofs 101203095 Nov 13 14:09 6.avs_63709304573
lrwxrwxrwx 1 leofs leofs        76 Nov 13 14:02 7.avs -> /home/leofs/dev/leofs.dev/package/leo_storage_0/avs/object/7.avs_63709304573
-rw-rw-r-- 1 leofs leofs  96205274 Nov 13 14:09 7.avs_63709304573

so if ${CONTAINER_ID} is 1 then the raw file including the content you want to recover would be 1.avs

the original content is stored in the raw file above from its offset to ${offset + size - 1} so you can retrieve the content using unix "cut" command or any script you would love to use

Cautions: The above step can be used only in case the file size is under 5MB. if the file size is over 5MB then a little extra steps would be needed so let me know if this is the case.

mocchira commented 5 years ago

Also, for this bucket, s3cmd ls is not working. It is working fine for other cluster.

if there are so many files under s3://datastore/ then it might not work (to be precise, take very long time) due to the limitation we have for now (See this issue https://github.com/leo-project/leofs/issues/548 for more details) so we'd recommend you not to use s3cmd ls (and also any ls alternatives which are exposed through other aws-sdk tools) because it not only takes much time to retrieve the result but also consumes much system resources (CPU, Disk, Network) with the current implementation (we are going to fix the issue on version 2.0). Nevertheless you want to use "s3cmd ls" then you would have to divide the one big directory into more smaller ones like

s3://datastore/00/00
s3://datastore/00/01
s3://datastore/00/02
s3://datastore/00/03
...
s3://datastore/00/97
s3://datastore/00/98
s3://datastore/00/99
s3://datastore/02/00
s3://datastore/02/01
...
s3://datastore/99/97
s3://datastore/99/98
s3://datastore/99/99

In this way, you can divide the datastore directory into 10,000 directories and as a result, since each directory get to have more smaller number of files now, "s3cmd ls s3://datastore/01/02" should work as intended.

Feel free to ask me any questions.

varuntanwar commented 5 years ago

At first, please refer to https://leo-project.net/leofs/docs/admin/system_operations/data/#diagnosis . As you can see, "leofs-adm diagnose-start" allows you to dump the object list including its offset, filename, size, etc.. so if you know the filename to be recovered then you can retrieve the original content based on its offset and size.

I dumped the object list on all the 5 nodes and then tried to "grep" the dumped file with $PATH_TO_FILE but didn't find the line. I tried it for 100 files

varuntanwar commented 5 years ago

I have a question regarding taking the backup of the leofs cluster, i know we can do it through Multi DC replication.

But can we do this : Take the backup of data inside https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L48 and store it on some other server.

Case 1: Suppose storage_node_102 goes down for some reason, we can suspend the node then add a new node with the same name. We copy the content back to same directory https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L48. And resume the node. Will this work?

Case 2: If lose some data in future, can we recover that by replacing the data inside https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L48 with the backup

mocchira commented 5 years ago

@varuntanwar Thanks for the quick try.

I dumped the object list on all the 5 nodes and then tried to "grep" the dumped file with $PATH_TO_FILE but didn't find the line. I tried it for 100 files

Sounds wired. If possible, can you share the dumped files with us? I'd like to check if it's the file I'm supposed to and it's not corrupted.

But can we do this : Take the backup of data inside https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L48 and store it on some other server.

Theoretically yes if you use a snapshot feature provided by the underlying file system, however we don't recommend taking backup in terms of the total disk capacity including the backup space. But it depends on the use case and requirements for your system so please go ahead as you like.

Suppose storage_node_102 goes down for some reason, we can suspend the node then add a new node with the same name. We copy the content back to same directory https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L48. And resume the node. Will this work?

Theoretically yes, however we don't recommend this way instead we'd recommend you taking the takeover approach we provide as described on https://leo-project.net/leofs/docs/admin/system_operations/cluster/#take-over-a-node (In short, add a new node and then copy contents from other storage nodes in the cluster)

If lose some data in future, can we recover that by replacing the data inside https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L48 with the backup

Theoretically yes, however needless to say, it's impossible to recover files inserted/updated/deleted between the prev backup and the new one in this way.

As of now, we've faced almost nothing data loss incident. Although a few users in our community had faced the data loss due to the software bug (already fixed in the latest 1.4.2) however as I said on the previous comment, if the compaction is not run yet then there still should be the lost contents inside the raw files LeoFS deals with so we were able to succeed in recovering the lost contents. I still believe your lost contents are still alive and will be able to recover so it would be great if you share the dumped files.

mocchira commented 5 years ago

@varuntanwar let me confirm what ${PATH_TO_FILE} you used just in case. The dumped files looks like below

194 201349472568129388035004242819842618096 test/43750  0   225 1542085764985347    2018-11-13 14:09:24 +0900   0
565 156689099056444311718732435081968791821 test/46875  0   239 1542085764980629    2018-11-13 14:09:24 +0900   0
950 279053908152492681737795699710683586281 test/15625  0   224 1542085764995193    2018-11-13 14:09:24 +0900   0
1320    234076597589249442808179939720948421859 test/43751  0   152 1542085765099917    2018-11-13 14:09:25 +0900   0
1618    234033039629873983006165090470339602065 test/1  0   8158    1542085765122587    2018-11-13 14:09:25 +0900   0
9918    295168669735862499360861024134310405371 test/9376   0   242 1542085765122841    2018-11-13 14:09:25 +0900   0
10305   312076637095363495485956169716896767658 test/31252  0   192 1542085765222310    2018-11-13 14:09:25 +0900   0

The third column is the file path and it's formatted in ${Bucket}/${Path} so you would have to grep something like "test/1234" in this case.

varuntanwar commented 5 years ago

Sounds wired. If possible, can you share the dumped files with us? I'd like to check if it's the file I'm supposed to and it's not corrupted.

I just checked that compaction was auto executed 2 days before. I guess we lost all the data

varuntanwar commented 5 years ago

s3cmd get s3://datastore/report/54a0f261-f3c4-4f0b-ab33-c9b4615ea6e5
WARNING: Retrying failed request: /report/54a0f261-f3c4-4f0b-ab33-c9b4615ea6e5 (EOF from S3!)
WARNING: Waiting 15 sec...
download: 's3://datastore/report/54a0f261-f3c4-4f0b-ab33-c9b4615ea6e5' -> './54a0f261-f3c4-4f0b-ab33-c9b4615ea6e5'  [1 of 1]
        0 of 31178240     0% in    0s     0.00 B/s  failed
ERROR: Download failed for: /report/54a0f261-f3c4-4f0b-ab33-c9b4615ea6e5: Skipping that file.  This is usually a transient error, please try again later.

But when I checked in the dump I found the entry : leo_object_storage_1.20181113.11.1:8769476350 3767630959941716272835727232270768457 datastore/report/54a0f261-f3c4-4f0b-ab33-c9b4615ea6e5 55242880 1541180514943738 2018-11-02 23:11:54 +0550 0

leo_object_storage_6.20181113.12.1:9052000820 193485395879674610491525457285422576883 datastore/report/54a0f261-f3c4-4f0b-ab33-c9b4615ea6e5 031178240 1541180515156713 2018-11-02 23:11:55 +0550 0

leo_object_storage_6.20181113.12.1:10115402453 269579822033922407338350020469519806605 datastore/report/54a0f261-f3c4-4f0b-ab33-c9b4615ea6e5 35242880 1541180514724582 2018-11-02 23:11:54 +0550 0

leo_object_storage_7.20181113.12.1:9866800783 59120570781876095962587087740670757949 datastore/report/54a0f261-f3c4-4f0b-ab33-c9b4615ea6e5 25

mocchira commented 5 years ago

But when I checked in the dump I found the entry :

Oh, then the procedure I share at the previous comment: https://github.com/leo-project/leofs/issues/1156#issuecomment-438145039 might work for you so please give it a try to recover the lost files.

varuntanwar commented 5 years ago

what is the best backup/restore strategy for leofs? I don't want to use another leofs cluster.

Will there be any impact on the cluster or on data if we add more storage nodes in the cluster with different storage size? Say right now I have 5 nodes in the cluster with 200 GB storage. I want to add more nodes but different storage size (500GB- 1TB) considering replication factor = 2.

mocchira commented 5 years ago

what is the best backup/restore strategy for leofs? I don't want to use another leofs cluster.

It depends on the use case and requirements however in almost all cases, I think no backup/restore is needed because all files in the leofs cluster are replicated in multiple storage nodes (It can be thought as it's always backed up). however as you faced, there is some possibility that data loss could happen due to unknown software bugs (not only leofs itself but any other underlying software stacks). In order to deal with such data loss incidents, leofs has adopted the append-only disk format to store user's contents (old contents keep remained inside raw files manged by leofs internally) so it allows us to recover any files at any point while the compaction is not run yet. so I'd recommend you turn off the auto-compaction and no backup/restore strategy.

Will there be any impact on the cluster or on data if we add more storage nodes in the cluster with different storage size? Say right now I have 5 nodes in the cluster with 200 GB storage. I want to add more nodes but different storage size (500GB- 1TB) considering replication factor = 2.

https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L67 will work for you. if you don't change this setting (default: 168) then

Set num_of_vnodes to 420 (168 * 2.5) on a storage with 500GB
Set num_of_vnodes to 840 (168 * 5) on a storage with 1TB

these settings enables leofs to utilize the disk space in accordance with its storage capacity.

varuntanwar commented 5 years ago

https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L67 will work for you. if you don't change this setting (default: 168)

What is the formula to calculate the value of num_of_vnodes? If I want to create a new cluster with 1 TB storage do we have to tweak the default value or is it only when storage nodes are of different size

I want to ask one more thing that, as of now both LeoFS manager master and slave both are running on same VM. But now we are increasing rapidly if we this VM goes down we will not be able to connect to the cluster. So can we change this configuration? Means Running manager master on one node and manager slave on another?

Can we change consistency.num_of_replicas to 3 now? will there be any impact on the data if I change the replication number and restart the manager node? if yes, what will be the correct steps to change it without impacting the cluster data?

mocchira commented 5 years ago

What is the formula to calculate the value of num_of_vnodes? If I want to create a new cluster with 1 TB storage do we have to tweak the default value or is it only when storage nodes are of different size

The default (168) should be suitable for typical cluster sizes (how many storage nodes the cluster consists of. we'd assume several dozen of servers) so that in almost all cases, you don't need to tweak the default unless you will try to deploy a very large scale cluster (ex. the cluster consists of over 100 servers).

If you want to know the detail of the formula to calculate the optimal num_of_vnodes then I'd recommend you read the article here: https://medium.com/@dgryski/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8

is it only when storage nodes are of different size

Yes if you have storage nodes with different size and want to utilize the disk space as much as possible according to each disk space.

I want to ask one more thing that, as of now both LeoFS manager master and slave both are running on same VM. But now we are increasing rapidly if we this VM goes down we will not be able to connect to the cluster. So can we change this configuration? Means Running manager master on one node and manager slave on another?

I think https://leo-project.net/leofs/docs/admin/system_admin/leo_manager/#case-2-launch-a-new-manager-masterslave-instead-of-a-collapsed-node-takeover work for your case. Please let us know if it doesn't work for you.

Although it might wander away from the subject, there is one thing I'd like to mention just in case.

if we this VM goes down we will not be able to connect to the cluster

To be precise, it isn't correct. You can connect to the cluster and operate any requests (GET/PUT/DELETE/HEAD) even if both of them go down. What you can't do during the both managers downed is

Unable to use leofs-adm so it's impossible to do any administrative tasks (recover/rebalance/get stats of the cluster and so on)

so there should be no effect for normal operations (handle PUT/GET/HEAD/DELETE requests from clients) during both managers downed.

varuntanwar commented 5 years ago

I think https://leo-project.net/leofs/docs/admin/system_admin/leo_manager/#case-2-launch-a-new-manager-masterslave-instead-of-a-collapsed-node-takeover work for your case. Please let us know if it doesn't work for you.

Ok I will try that.

I have one more question: Can we change consistency.num_of_replicas to 3 now? will there be any impact on the data if I change the replication number and restart the manager node? if yes, what will be the correct steps to change it without impacting the cluster data?

mocchira commented 5 years ago

Can we change consistency.num_of_replicas to 3 now? will there be any impact on the data if I change the replication number and restart the manager node? if yes, what will be the correct steps to change it without impacting the cluster data?

Unfortunately it's impossible to change consistency.num_of_replicas after starting the cluster. The only way to do so without impacting the current cluster data is

Building a new cluster with consistency.num_of_replicas set to 3
Copy all objects stored in the old (current) cluster to the new one
Switch VIP to the new one

varuntanwar commented 5 years ago

I think https://leo-project.net/leofs/docs/admin/system_admin/leo_manager/#case-2-launch-a-new-manager-masterslave-instead-of-a-collapsed-node-takeover work for your case. Please let us know if it doesn't work for you.

I tried to test this case. When I tried to execute leofs-adm dump-mnesia-data <absolute-path> I found out that I don't have dump-mnesia-data option. Options available are

Manager Maintenance:
          backup-mnesia <backup-filepath>
          restore-mnesia <backup-filepath>
          update-managers <manager-master> <manager-slave>
          dump-ring (<manager-node>|<storage-node>|<gateway-node>)
          update-log-level (<storage-node>|<gateway-node>) (debug|info|warn|error)
          update-consistency-level <write-quorum> <read-quorum> <delete-quorum>

I also tried to use the backup-mnesia, I am getting this error :

root@stg-leofs301:/home/varun.tanwar# leofs-adm backup-mnesia dump.dat
[ERROR] {'EXIT',{error,{file_error,"dump.dat.BUPTMP",eacces}}}

Also, tried to create the file dump.dat, change the owner to leofs and file permission to 777 and executed the command again but getting same error again.

When I stopped the leofs-manager-master on this node and tried to execute the backup-mnesia command, I am getting following error:

root@stg-leofs301:/home/varun.tanwar# leofs-adm backup-mnesia /root/dump.dat
Error: couldn't connect to LeoFS Manager

I tried to execute the command on leofs-manager-slave node, I am getting the same error

So, my point is if my manager node is down i.e vm got crashed I cannot execute the leofs-adm command. What should be the action item for me?

mocchira commented 5 years ago

@varuntanwar

When I tried to execute leofs-adm dump-mnesia-data I found out that I don't have dump-mnesia-data option.

Sorry for the inconvenience. it should be fixed to "backup-mnesia" as you tried. Thanks for catching it.

Also, tried to create the file dump.dat, change the owner to leofs and file permission to 777 and executed the command again but getting same error again.

Have you also changed the owner of a directory where the backup file is dumped OR change the permission 777 or something which enable leo_manager (running as leofs user) to dump the backup file? I'd suspect there seems to be something wrong around the permission.

So, my point is if my manager node is down i.e vm got crashed I cannot execute the leofs-adm command. What should be the action item for me?

Don't stop the manager node. keep it running and give it another try in a way as I mentioned above.

varuntanwar commented 5 years ago

Have you also changed the owner of a directory where the backup file is dumped OR change the permission 777 or something which enable leo_manager (running as leofs user) to dump the backup file? I'd suspect there seems to be something wrong around the permission.

The file is located in my home directory. /home/varun.tanwar/dump.dat I can not change the owner of my home directory or change the permission to 777. I have created a new directory /leofs and change ownership after that "backup-mnesia" successfully executed.

root@stg-leofs301:/home/varun.tanwar# leofs-adm backup-mnesia /leofs/dump.txt
OK

Now will try to execute the remaining steps.

Don't stop the manager node. keep it running and give it another try in a way as I mentioned above. This is what I wanted to ask what to do if my manager node crashed at any point of time. Case I : What to do if I don't have the backup-mnesia file

Case 2: Suppose I executes the backup-mnesia today and saves it somewhere else. After a month, my VM gets crashed, will creating a new vm and loading this mnesia file be sufficient for my cluster? Will there be any impact on data? What will happen to the files which will be created/deleted during this time period ( 1 month) ? What will happen if I add any nodes during this period and I haven't executed the "backup-mnesia" command after adding or removing any node? Am I suppose to execute the backup-mnesia command everytime I add or delete a node.?

mocchira commented 5 years ago

@varuntanwar Sorry for the long delay. I'm back now :)

Now will try to execute the remaining steps.

Good to hear that.

This is what I wanted to ask what to do if my manager node crashed at any point of time. Case I : What to do if I don't have the backup-mnesia file

Good point. If it's possible to restart the manager node then restart and issue backup-mnesia via leofs-adm. if not then unfortunately there is no clue at this point, however what we have to do for dumping the backup file is just calling https://github.com/leo-project/leofs/blob/391216c54e9868f5c7b2ef6fa5d2c6881b190535/apps/leo_manager/src/leo_manager_mnesia.erl#L762-L783 this function so it's relatively easy to implement the tool (executable command) which allow us to dump the backup file without running leo_manager. I'm going to file the issue to provide users with the tool.

Case 2: Suppose I executes the backup-mnesia today and saves it somewhere else. After a month, my VM gets crashed, will creating a new vm and loading this mnesia file be sufficient for my cluster? Will there be any impact on data? What will happen to the files which will be created/deleted during this time period ( 1 month) ? What will happen if I add any nodes during this period and I haven't executed the "backup-mnesia" command after adding or removing any node?

What leo_manager stores in mnesia are

member nodes in the cluster
ring information
s3 related records (users, endpoints and buckets)

That being said, there is no impact on data, however there is some impact if the member in the cluster is changed so

Am I suppose to execute the backup-mnesia command everytime I add or delete a node.?

You are right. it's best practice to take the backup whenever the cluster topology changes. I'm going to file the issue to provide this information on our official document.

varuntanwar commented 5 years ago

I ran out of space on the cluster so I have started the compaction on one of the node, cluster is up (read/writes are going). I hope compaction wouldn't affect data and writes.

My concern is: This node is having 175GB of data and ratio of active size is 71%. I have started the compaction with default settings: Num. of Targets: 8 Num of Comp Proc: 1

So it is happening very slowly. ( ~0.01% in 20 min).

My Question is: Can I stop the compaction? If yes, how can I do it?.

If not then, can I suspend the compaction and add 1 node on the cluster and do the rebalancing and start the compaction again on this node? Will there be any impact on the data?

How can I add one node in the cluster without impacting the compaction and data?

mocchira commented 5 years ago

@varuntanwar

Can I stop the compaction? If yes, how can I do it?.

Good question. The answer is yes. Since the compaction status (idle/running/suspending/resuming) is stored on memory, once you restart LeoStorage while it's suspending, the compaction will be stopped.

If not then, can I suspend the compaction and add 1 node on the cluster and do the rebalancing and start the compaction again on this node? Will there be any impact on the data?

Yes you can and also there should be no impact on any data. One thing you would have to care is

the compaction generate a new file (more precisely, it scans the existing file and copy all objects which delete flag is false to a new file then remove the (old) existing file so that being said, you would have to ensure the enough disk space for the new file before running the compaction

How can I add one node in the cluster without impacting the compaction and data?

Just stop the compaction as I described above and do the rebalance according to the official document.

varuntanwar commented 5 years ago

After adding a new node, executed the rebalancing command, how can I check that data transfer is completed (how to check that how much time it will take to redistribute the data). I have 1TB of data in the cluster.

As, I have to add another node. I want to make sure that data is successfully transferred. Or can I add the remove/add another node as rebalancing is done successfully will it impact the data even if data transfer is happening?

varuntanwar commented 5 years ago

I added the node to the cluster. As you suggested, I added the node with 500G drive (vnode = 420) But on this node, it has been more than 8 hours, this node got only 70G of data. Other nodes still have around 170-180G data (these nodes have 200G drive).

I checked the mq status of each nodes:

First nodes:

leofs-adm mq-stats storage_101: 
id                             |    state    | number of msgs | batch of msgs  |  interval      |                 description
leo_per_object_queue           |   running   | 34867          | 1600           | 500            | recover inconsistent objs
leo_rebalance_queue            |   running   | 843843         | 1600           | 500            | rebalance objs

Second Node:

leo_per_object_queue           |   running   | 10856          | 1600           | 500            | recover inconsistent objs
leo_rebalance_queue            |   running   | 1241511        | 1600           | 500            | rebalance objs

Third Node:

leo_per_object_queue           |   running   | 1063           | 1600           | 500            | recover inconsistent objs
leo_rebalance_queue            |   running   | 628580         | 1600           | 500            | rebalance objs

Fourth . Node:

leo_per_object_queue           |   running   | 7018           | 1600           | 500            | recover inconsistent objs
leo_rebalance_queue            |   running   | 961447         | 1600           | 500            | rebalance objs

Fifth Node:

leo_per_object_queue           |   running   | 1415           | 1600           | 500            | recover inconsistent objs
leo_rebalance_queue            |   running   | 478376         | 1600           | 500            | rebalance objs

Sixth Node:

leo_per_object_queue           |   running   | 1304           | 1600           | 500            | recover inconsistent objs
leo_rebalance_queue            |   running   | 386895         | 1600           | 500            | rebalance objs

As of now, writes are disabled, as 3 nodes started throwing 500 when storage gets filled (around 190G). What should be done here? Should I start writes? Will it go to 6th Node(with 500G) but second copy of file will go to other node? I also tried to add another node with 500G storage, but again we were getting 500s error so removed that node.

varuntanwar commented 5 years ago

I also see difference in ring hash number:

leofs-adm status
 [System Confiuration]
-----------------------------------+----------
 Item                              | Value
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.3.8
                        cluster Id | leofs_1
                             DC Id | dc_1
                    Total replicas | 2
          number of successes of R | 1
          number of successes of W | 1
          number of successes of D | 1
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash | bddcbf06
                previous ring-hash | bddcbf06
-----------------------------------+----------

 [State of Node(s)]
-------+-------------------------------------------+--------------+----------------+----------------+----------------------------
 type  |                   node                    |    state     |  current ring  |   prev ring    |          updated at
-------+-------------------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_101@prd-leofs101      | running      | 23d9d7ea       | bddcbf06       | 2018-11-02 23:46:22 +0550
  S    | storage_102@prd-leofs102      | running      | bddcbf06       | bddcbf06       | 2018-08-06 13:48:16 +0550
  S    | storage_103@prd-leofs103      | running      | bddcbf06       | bddcbf06       | 2018-12-22 15:32:53 +0550
  S    | storage_104@prd-leofs104      | running      | bddcbf06       | bddcbf06       | 2018-11-02 22:57:35 +0550
  S    | storage_105@prd-leofs105      | running      | bddcbf06       | bddcbf06       | 2018-09-06 19:29:07 +0550
  S    | storage_106@prd-leofs106      | running      | bddcbf06       | bddcbf06       | 2018-12-22 14:59:30 +0550
  G    | gateway_102@prd-leofs102      | running      | bddcbf06       | bddcbf06       | 2018-08-06 13:48:45 +0550
  G    | gateway_103@prd-leofs103      | running      | bddcbf06       | bddcbf06       | 2018-12-22 19:49:19 +0550
  G    | gateway_105@prd-leofs105      | running      | bddcbf06       | bddcbf06       | 2018-11-27 16:53:48 +0550
-------+-------------------------------------------+--------------+----------------+----------------+----------------------------

I checked the issue 1078. Should I follow same steps to resolve this? But that is for resolving ring for Manager nodes, what should I do for storage node?

mocchira commented 5 years ago

@varuntanwar Sorry for the late reply.

Now I don't know what the current status of your cluster is, so I'd like to answer the each question.

After adding a new node, executed the rebalancing command, how can I check that data transfer is completed (how to check that how much time it will take to redistribute the data).

As you may already get to know, leofs-adm mq-stats would tell you the progress. In case of rebalance, you can see the progress at leo_per_object_queue and leo_rebalance_queue. After starting the rebalance, both will start to increase. Once both get close to zero, you can think of the rebalance as being finished.

Or can I add the remove/add another node as rebalancing is done successfully will it impact the data even if data transfer is happening?

In theory you can, however I don't recommend doing so as it can place a heavy load on a cluster instead I'd recommend you wait for the first rebalance.

As of now, writes are disabled, as 3 nodes started throwing 500 when storage gets filled (around 190G). What should be done here? Should I start writes? Will it go to 6th Node(with 500G) but second copy of file will go to other node?

It seems that as you can see on mq-stats, the rebalance is still not finished yet and the reason why the rebalance takes so much time probably would be that each storage get filled. Even you disable the write operations from clients, the rebalance itself would need a certain disk space to store messages in the rebalance queue so you would have to ensure that disk space somehow in order to finish the rebalance successfully.

Should I follow same steps to resolve this? But that is for resolving ring for Manager nodes, what should I do for storage node?

For the inconsistent RING on LeoStorage, you can use leofs-adm recover-ring.

varuntanwar commented 5 years ago

It seems that as you can see on mq-stats, the rebalance is still not finished yet and the reason why the rebalance takes so much time probably would be that each storage get filled. Even you disable the write operations from clients, the rebalance itself would need a certain disk space to store messages in the rebalance queue so you would have to ensure that disk space somehow in order to finish the rebalance successfully.

How much disk should be available to ensure that rebalancing does not stuck and we dont get 500 while uploading the files as space was still available? Before I started the rebalancing, I had 5 nodes with 200G drive. Out of which 2 had 24-25G space and remaining 3 had 10-12G space available. Total space I had : 70-80G. But the disk with 10-12G started returning 500s.

For the inconsistent RING on LeoStorage, you can use leofs-adm recover-ring.
varun.tanwar@prd-leofs101:~$ leofs-adm recover-ring storage_101@prd-leofs101
OK

For a moment it shows the correct ring ID:

[State of Node(s)]
-------+-------------------------------------------+--------------+----------------+----------------+----------------------------
 type  |                   node                    |    state     |  current ring  |   prev ring    |          updated at
-------+-------------------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_101@prd-leofs101      | running      | bddcbf06       | bddcbf06       | 2018-12-24 13:23:58 +0550
  S    | storage_102@prd-leofs102     | running      | bddcbf06       | bddcbf06       | 2018-08-06 13:48:16 +0550

But again after that:

[State of Node(s)]
-------+-------------------------------------------+--------------+----------------+----------------+----------------------------
 type  |                   node                    |    state     |  current ring  |   prev ring    |          updated at
-------+-------------------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_101@prd-leofs101      | running      | 23d9d7ea       | bddcbf06       | 2018-12-24 13:23:58 +0550

After executing recover command, Each time I execute leofs-adm status, I am getting different ring ID 1-2 times out of 10 , I got current ring = bddcbf06 //which is correct. Otherwise I am getting 23d9d7ea as current ring.

varuntanwar commented 5 years ago

This time, we added only 1 node and ran the rebalance. We lost around 12k file. As you suggested last time, we can do the "leofs-adm diagnose-start" but it will run compaction in the background which I dont want [running comapaction], as I dont know how it will impact as rebalancing is still going.

mocchira commented 5 years ago

@varuntanwar

How much disk should be available to ensure that rebalancing does not stuck and we dont get 500 while uploading the files as space was still available? Before I started the rebalancing, I had 5 nodes with 200G drive. Out of which 2 had 24-25G space and remaining 3 had 10-12G space available. Total space I had : 70-80G. But the disk with 10-12G started returning 500s.

LeoFS starts to respond 500 to clients once there is not enough disk space to execute compaction safely as described on https://github.com/leo-project/leofs/issues/592. So the free space not returning 500 depends on the most largest AVS file (files suffixed by .avs are used to store objects inside LeoFS) on the server. That being said, the most largest AVS file size on your server would probably be around 10-12GB now.

So you would have to run the compaction or delete some files on the server to ensure enough disk space for getting the system back to the normal status.

After executing recover command, Each time I execute leofs-adm status, I am getting different ring ID 1-2 times out of 10 , I got current ring = bddcbf06 //which is correct. Otherwise I am getting 23d9d7ea as current ring.

This is wired. Something bad may be happening on your system so can you share the dump files generated by leofs-adm dump-ring $ServerName on storage_101@prd-leofs101 and another healthy storage node? (files will be dumped under the path specified by https://github.com/leo-project/leofs/blob/v1/apps/leo_storage/priv/leo_storage.conf#L328)

We lost around 12k file. As you suggested last time, we can do the "leofs-adm diagnose-start" but it will run compaction in the background which I dont want [running comapaction], as I dont know how it will impact as rebalancing is still going.

Although executing diagnose-start show you the status "running compaction", the compaction actually never happens (this is a known bug showing a incorrect message to the display, For more detail, please refer to https://github.com/leo-project/leofs/issues/510) so you can issue diagnose-start without impacting any data on your system (it's read only operation just scanning raw AVS files and make a list including what files are included in the raw files and where they are stored (offset in byte) and so on).

varuntanwar commented 5 years ago

This is wired. Something bad may be happening on your system so can you share the dump files generated by leofs-adm dump-ring $ServerName on storage_101@prd-leofs101 and another healthy storage node?

Output of Mem_cur.dump On storage_101@prd-leofs101:

{member,'storage_101@prd-leofs101',"node_a3ea23ad",
       "prd-leofs101",13077,ipv4,1525341531470566,running,168,[],
       []}.
{member,'storage_102@prd-leofs102',"node_12cca542",
       "prd-leofs102",13077,ipv4,1525341217625015,running,168,[],
       []}.
{member,'storage_103@prd-leofs103',"node_727b8803",
       "prd-leofs103",13077,ipv4,1525341735745309,running,168,[],
       []}.
{member,'storage_104@prd-leofs104',"node_19250343",
       "prd-leofs104",13077,ipv4,1525342255709516,running,168,[],
       []}.
{member,'storage_105@prd-leofs105',"node_93e52965",
       "prd-leofs105",13077,ipv4,1525342210708889,running,168,[],
       []}.
{member,'storage_106@prd-leofs106',"node_1840182a",
       "prd-leofs106",13077,ipv4,1545637777653420,running,420,[],
       []}.

On storage_104@prd-leofs104:

{member,'storage_101@prd-leofs101',"node_a3ea23ad",
        "prd-leofs101",13077,ipv4,1525341531470566,running,168,[],
        []}.
{member,'storage_102@prd-leofs102',"node_12cca542",
        "prd-leofs102",13077,ipv4,1525341217625015,running,168,[],
        []}.
{member,'storage_103@prd-leofs103',"node_727b8803",
        "prd-leofs103",13077,ipv4,1525341735745309,running,168,[],
        []}.
{member,'storage_104@prd-leofs104',"node_19250343",
        "prd-leofs104",13077,ipv4,1525342255709516,running,168,[],
        []}.
{member,'storage_105@prd-leofs105',"node_93e52965",
        "prd-leofs105",13077,ipv4,1525342210708889,running,168,[],
        []}.
{member,'storage_106@prd-leofs106',"node_1840182a",
        "prd-leofs106",13075,ipv4,1545470970510719,running,420,[],
        []}.

Should I share other files also?

mocchira commented 5 years ago

Should I share other files also?

Yes please share all dumped files. (as some of them tend to be relatively large, it would be good to attach files by dragging and dropping files rather than copying and pasting the file contents.

varuntanwar commented 5 years ago

Storage101 -> unhealthy node storage104 -> healthy node

storage101_dump.tar.zip storage104_dump.tar.zip

mocchira commented 5 years ago

@varuntanwar Thanks. the big difference between 101 and 104 is

leofs@cat2neat:leofs.dev$ diff storage101_dump/members_cur.dump.63712942729 storage104_dump/members_cur.dump.63712942735
17c17
<         "prd-leofs106",13077,ipv4,1545637777653420,running,420,[],
---
>         "prd-leofs106",13075,ipv4,1545470970510719,running,420,[],

This means that 101 recognizes 106 storage node running on port 13077 while 104 recognizes 106 running on port 13075. Since a different port number can make a different ring hash, this is the reason why 101 ring hash is broken.

Let me confirm that

which port number is correct? (On prd-leofs106 which port number is opened by LeoStorage?)
Had you ever run LeoStorage on 13077 at first then have you run LeoStorage on 13075 (or vise versa)?

and also just in case can you share the dump files on LeoManager master node?

P.S. fortunately there is no differences in other dump files including ring table information so routing (filename, storage nodes mapping) should be working properly.

varuntanwar commented 5 years ago

which port number is correct? (On prd-leofs106 which port number is opened by LeoStorage?)

tcp 0 0 0.0.0.0:13077 0.0.0.0:* LISTEN 30155/beam.smp

Had you ever run LeoStorage on 13077 at first then have you run LeoStorage on 13075 (or vise versa)?

No always used the default configs

and also just in case can you share the dump files on LeoManager master node?

{member,'storage_101@prd-leofs101',"node_a3ea23ad",
"prd-leofs101",13077,ipv4,1525341531470566,running,168,[],
[]}.
{member,'storage_102@prd-leofs102',"node_12cca542",
"prd-leofs102",13077,ipv4,1525341217625015,running,168,[],
[]}.
{member,'storage_103@prd-leofs103',"node_727b8803",
"prd-leofs103",13077,ipv4,1525341735745309,running,168,[],
[]}.
{member,'storage_104@prd-leofs104',"node_19250343",
"prd-leofs104",13077,ipv4,1525342255709516,running,168,[],
[]}.
{member,'storage_105@prd-leofs105',"node_93e52965",
"prd-leofs105",13077,ipv4,1525342210708889,running,168,[],
[]}.
{member,'storage_106@prd-leofs106',"node_1840182a",
"prd-leofs106",13075,ipv4,1545470970510719,running,420,[],
[]}.

Manager is also having wrong port number.

Right now, rebalancing is under process. We found out that few of the files(around 1k) are in deleted state when we execute whereis command. Is there any way to get those files by removing the delete marker from those files. (We are not sure about running diagnose command and to retrieve each file)

mocchira commented 5 years ago

@varuntanwar Thanks for the reply.

tcp 0 0 0.0.0.0:13077 0.0.0.0:* LISTEN 30155/beam.smp

Got it.

No always used the default configs Manager is also having wrong port number.

Hmm. so there are no port configurations set to 13075 across you cluster, right? then it seems to be a symptom something has been broken so I'm going to investigate around it and confirm whether or not this wrong port number can affect the system. Once I find out something, I'll get back to you.

Right now, rebalancing is under process. We found out that few of the files(around 1k) are in deleted state when we execute whereis command. Is there any way to get those files by removing the delete marker from those files. (We are not sure about running diagnose command and to retrieve each file)

Does it mean you don't understand how to use diagnose-start command and retrieve the lost files based on the output generated by diagnose-start right?

https://github.com/leo-project/leofs/issues/1156#issuecomment-438145039

Have you check the above comment I dropped around one month ago? if no please check it out as this is the procedure to recover lost files using diagnose.

Let me know if you have any question on the procedure I wrote on the above comment.

varuntanwar commented 5 years ago

Have you check the above comment I dropped around one month ago?

I checked that comment. I understood the process. Thing is, I have to do this for each file (as of now it is more than 1K). Is there anything what will remove the delete marker of the file so that I dont have to go to each node to get file manually.( I know that Leofs does not support versioning ). And most of the files are more than 5MB. Please tell the steps to recover these files.

We found out that few of the files(around 1k) are in deleted state

This is after we started the rebalancing. Before that, these files are accessible. We dont have any record of deleting those files. Is it because rebalancing is still under process? Will those file become accessible after rebalancing completes?

it seems to be a symptom something has been broken so I'm going to investigate around it and confirm whether or not this wrong port number can affect the system

Will this effect any read or write on the system?

mocchira commented 5 years ago

@varuntanwar

Is there anything what will remove the delete marker of the file so that I dont have to go to each node to get file manually.( I know that Leofs does not support versioning ).

Then unfortunately the answer is No.

And most of the files are more than 5MB. Please tell the steps to recover these files.

As mentioned on the prev comment, the only way to recover these files is to issue diagnose-start and retrieve the content from AVS by hand (or some tool to do the same thing I described at that comment)

This is after we started the rebalancing. Before that, these files are accessible. We dont have any record of deleting those files. Is it because rebalancing is still under process? Will those file become accessible after rebalancing completes?

I'm still not sure about it because there are three wired things I can't still figure out.

RING hash broken due to the wrong port
the port number was not set through your configuration

however If I understand your statements correctly, you did "attache rebalance and detach rebalance subsequently although the prev rebalance was still not finished yet? then I think this is probably the reason why some files are inaccessible after rebalance. rebalance should NOT be issued until the previous rebalance is finished.

How to recover large files

If you want fo recover files which size is larger than 5MB then your would have to find out all chunks which consist of the whole file. the file chunk has non-zero value at the forth column (Child num) according to https://leo-project.net/leofs/docs/admin/system_operations/data/#diagnosis so let's say you have a 14MB file named ${PATH_TO_FILE} then there are three lines across files generated by leofs-adm diagnose-start (the lines can be scattered across the multiple files) like below

Filename is set to ${PATH_TO_FILE} and Childnum is set to 0 in 0.avs on Storage01
Filename is set to ${PATH_TO_FILE} and Childnum is set to 1 in 2.avs on Storage02
Filename is set to ${PATH_TO_FILE} and Childnum is set to 2 in 7.avs on Storage04

then concat those three chunks in Childnum order using cat command provided by almost all xnix platform That's all for recovering the large file. Sorry for the inconvenience however this is the only way to do so for now (we are going to provide the tool which do the same thing automatically for you in the near future (probably))

mocchira commented 5 years ago

oh I forgot the last step to recover a large file.

Lastly you would have to PUT the recovered file through some s3 client.

leo-project / leofs

leofs-storage {cause,not_found} #1156

1157

1158

1159

Steps