Closed tnatanael closed 5 years ago
@tnatanael What you'd have to do for deleting file physically is "compaction-start" as described here: http://leo-project.net/leofs/docs/admin/system_operations/data/#how-to-operate-data-compaction
please check the doc above out for more details.
Tried with leofs-adm compact-start, it says OK, but the file persists, even after waiting the process to finish
Let us know your LeoFS' error log and the state of the large object:
leofs-adm whereis <OBJECT_NAME>
commandHow can i discover the object name? Is it the filename of the original file?
when i run compact and when i try to delete the file using s3 api this error message pops on log
Is it the filename of the original file?
Exactly, leofs-adm whereis <file-path>
I tried with only filename... leofs-adm whereis 1000mb_1
And bucket + filename leofs-adm whereis teste-thiago/1000mb_1
The 2 options says [ERROR] Could not get ring
I am 100% sure that the file was corrupt due to disk failures, but it may need to be cleared by the manual delete or auto by the cluster in some way
I've understood that your LeoFS' RING (routing table) is broken. So let me know the current state of the system. Can you share the result of leofs-adm status
and the operation histories to this day?
Sure!!!
It is a test cluster, i am simulating a disk failure we experienced in production
How can i get the operations history?
How can i get the operations history?
$ history | gpre leofs-adm
Do you want a team viewer session?
I've just clearly understood that LeoManager's RING is broken. I'm going to consider how to recover the system.
Ok... only to state, the cluster is still working, i am uploading and removing new files right now... only this file is undeletable...
TO: @mocchira your opinion will be much appreciated.
Hy guys! Can this ticket be labelled as a bug instead of question?
Please do the following if you understand that we may NOT be able to restore your system completely. I considered how to recover your LeoManager’s RING as below:
leofs-adm start
leofs-adm status
If you succeeded in doing the procedure, you can execute the data-compaction
command.
I'd like to share an example of the procedure of recovering LeoManager's RING as below.
$ leofs-adm status
[System Confiuration]
-----------------------------------+----------
Item | Value
-----------------------------------+----------
Basic/Consistency level
-----------------------------------+----------
system version | 1.5.0
cluster Id | leofs_1
DC Id | dc_1
Total replicas | 2
number of successes of R | 1
number of successes of W | 1
number of successes of D | 1
number of rack-awareness replicas | 0
ring size | 2^128
-----------------------------------+----------
Multi DC replication settings
-----------------------------------+----------
[mdcr] max number of joinable DCs | 2
[mdcr] total replicas per a DC | 1
[mdcr] number of successes of R | 1
[mdcr] number of successes of W | 1
[mdcr] number of successes of D | 1
-----------------------------------+----------
Manager RING hash
-----------------------------------+----------
current ring-hash |
previous ring-hash |
-----------------------------------+----------
[State of Node(s)]
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
type | node | state | rack id | current ring | prev ring | updated at
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
S | storage_0@127.0.0.1 | running | | d5d667a6 | d5d667a6 | 2019-05-23 10:10:33 +0900
S | storage_1@127.0.0.1 | running | | d5d667a6 | d5d667a6 | 2019-05-23 10:10:33 +0900
S | storage_2@127.0.0.1 | running | | d5d667a6 | d5d667a6 | 2019-05-23 10:10:33 +0900
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
$ ./package/leo_manager_0/bin/leo_manager stop
ok
$ ./package/leo_manager_1/bin/leo_manager stop
ok
$ ./package/leo_gateway_0/bin/leo_gateway stop
ok
$ ./package/leo_storage_0/bin/leo_storage stop
ok
$ ./package/leo_storage_1/bin/leo_storage stop
ok
$ ./package/leo_storage_2/bin/leo_storage stop
ok
$ tar czf leo_manager_0_backup.ta.gz ./package/leo_manager_0/
$ tar czf leo_manager_1_backup.ta.gz ./package/leo_manager_1/
$ ls -la | grep backup.ta.gz
-rw-r--r-- 1 yosukehara staff 15435040 May 23 10:12 leo_manager_0_backup.ta.gz
-rw-r--r-- 1 yosukehara staff 15429047 May 23 10:12 leo_manager_1_backup.ta.gz
## manager_0:
$ rm -rf ./package/leo_manager_0/work/mnesia/*
## manager_1:
$ rm -rf ./package/leo_manager_1/work/mnesia/*
$ ./package/leo_manager_0/bin/leo_manager start
$ ./package/leo_manager_1/bin/leo_manager start
$ ./package/leo_storage_0/bin/leo_storage start
$ ./package/leo_storage_1/bin/leo_storage start
$ ./package/leo_storage_2/bin/leo_storage start
$ leofs-adm status
[System Confiuration]
-----------------------------------+----------
Item | Value
-----------------------------------+----------
Basic/Consistency level
-----------------------------------+----------
system version | 1.5.0
cluster Id | leofs_1
DC Id | dc_1
Total replicas | 2
number of successes of R | 1
number of successes of W | 1
number of successes of D | 1
number of rack-awareness replicas | 0
ring size | 2^128
-----------------------------------+----------
Multi DC replication settings
-----------------------------------+----------
[mdcr] max number of joinable DCs | 2
[mdcr] total replicas per a DC | 1
[mdcr] number of successes of R | 1
[mdcr] number of successes of W | 1
[mdcr] number of successes of D | 1
-----------------------------------+----------
Manager RING hash
-----------------------------------+----------
current ring-hash |
previous ring-hash |
-----------------------------------+----------
[State of Node(s)]
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
type | node | state | rack id | current ring | prev ring | updated at
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
S | storage_0@127.0.0.1 | attached | | | | 2019-05-23 10:14:00 +0900
S | storage_1@127.0.0.1 | attached | | | | 2019-05-23 10:14:03 +0900
S | storage_2@127.0.0.1 | attached | | | | 2019-05-23 10:14:05 +0900
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
leofs-adm start
command:$ leofs-adm start
Generating RING...
Generated RING
OK 33% - storage_0@127.0.0.1
OK 67% - storage_2@127.0.0.1
OK 100% - storage_1@127.0.0.1
OK
$ ./package/leo_gateway_0/bin/leo_gateway start
$ ./leofs-adm status
[System Confiuration]
-----------------------------------+----------
Item | Value
-----------------------------------+----------
Basic/Consistency level
-----------------------------------+----------
system version | 1.5.0
cluster Id | leofs_1
DC Id | dc_1
Total replicas | 2
number of successes of R | 1
number of successes of W | 1
number of successes of D | 1
number of rack-awareness replicas | 0
ring size | 2^128
-----------------------------------+----------
Multi DC replication settings
-----------------------------------+----------
[mdcr] max number of joinable DCs | 2
[mdcr] total replicas per a DC | 1
[mdcr] number of successes of R | 1
[mdcr] number of successes of W | 1
[mdcr] number of successes of D | 1
-----------------------------------+----------
Manager RING hash
-----------------------------------+----------
current ring-hash | d5d667a6
previous ring-hash | d5d667a6
-----------------------------------+----------
[State of Node(s)]
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
type | node | state | rack id | current ring | prev ring | updated at
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
S | storage_0@127.0.0.1 | running | | d5d667a6 | d5d667a6 | 2019-05-23 10:14:16 +0900
S | storage_1@127.0.0.1 | running | | d5d667a6 | d5d667a6 | 2019-05-23 10:14:16 +0900
S | storage_2@127.0.0.1 | running | | d5d667a6 | d5d667a6 | 2019-05-23 10:14:16 +0900
G | gateway_0@127.0.0.1 | running | | d5d667a6 | d5d667a6 | 2019-05-23 10:14:32 +0900
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
The important thing is that the values of current ring
and prev ring
before recovery and the values after recovery are the same - current ring: d5d667a6
, prev ring: d5d667a6
.
I'll try this tomorrow and return, but i am wondering why this happens? It appear to be an expected behaviour? Thanks for now!
Sorry for delay... it worked... after that procedure i was able to delete the file... Thanks!
@yosukehara I tried to do follow your instructions but I found some problem. After I restart all leofs service. All User and Buckets are disappear.
After Remove mnesia folder
So I recovery mnesia folder at leo_manager Everything is back. but RING is broken again.
Can you suggest us how to fix this problem?
Thanks
Hy guys, i created a simple cluster, with 2 storages, and after uploading a file with 1Gb and running the cluster for 1 week, i am not able to delete this file, the delete operation runs with success but the file persists...
What i tried: recover-node recover-disk recover-consistency
I wonder that when i put the cluster in the production env, with so many files this would be a very annoying bug, so please help me.