lizardfs / lizardfs

LizardFS is an Open Source Distributed File System licensed under GPLv3.
http://lizardfs.com
GNU General Public License v3.0
955 stars 187 forks source link

serious structure inconsistency: (chunkid:0000000000000019) #480

Open ghost opened 8 years ago

ghost commented 8 years ago

I tried to configure a fresh LizardFS install (in Arch Linux, LizardFS version 3.10.2) and I got this error:

Sep 18 12:49:40 desktop mfsmaster[30132]: serious structure inconsistency: (chunkid:0000000000000019)

I've notice this before the error below:

Sep 18 12:25:23 desktop mfsmaster[14694]: chunk_delete_file_int: Trying to remove non-existent goal: 1
Sep 18 12:25:23 desktop mfsmaster[14694]: structure error - chunk 0000000000000001 not found (inode: 5 ; index: 0)
Sep 18 12:25:23 desktop mfsmaster[14694]: chunk_delete_file_int: Trying to remove non-existent goal: 1
Sep 18 12:25:23 desktop mfsmaster[14694]: structure error - chunk 0000000000000002 not found (inode: 5 ; index: 1)
Sep 18 12:25:37 desktop mfsmaster[14694]: serious structure inconsistency: (chunkid:0000000000000005)

And also:

Sep 18 14:24:11 desktop mfsmaster[13498]: release: session not found

Any idea about what caused that issue?

Thank you.

PS: I reinstalled LizardFS again and i got the same issue :( . (FYI LizardFS was compiled on Arch Linux with this PKGBUILD: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=lizardfs )

nanowish commented 8 years ago

Same issue here with debian pkg 3.10.2+dfsg-1 and I can't write any files.

Sep 25 15:22:33 clbk-master mfsmaster[15936]: serious structure inconsistency: (chunkid:00000000000000AD)
Sep 25 15:22:33 clbk-master mfsmaster[15936]: mfsmaster[15936]: serious structure inconsistency: (chunkid:00000000000000AD)
Sep 25 15:22:43 clbk-master mfsmaster[15936]: serious structure inconsistency: (chunkid:00000000000000AD)
Sep 25 15:22:43 clbk-master mfsmaster[15936]: mfsmaster[15936]: serious structure inconsistency: (chunkid:00000000000000AD)
Sep 25 15:22:53 clbk-master mfsmaster[15936]: serious structure inconsistency: (chunkid:00000000000000AD)
Sep 25 15:22:53 clbk-master mfsmaster[15936]: mfsmaster[15936]: serious structure inconsistency: (chunkid:00000000000000AD)
Sep 25 15:23:03 clbk-master mfsmaster[15936]: serious structure inconsistency: (chunkid:00000000000000AD)
Sep 25 15:23:03 clbk-master mfsmaster[15936]: mfsmaster[15936]: serious structure inconsistency: (chunkid:00000000000000AD)
Sep 25 15:23:13 clbk-master mfsmaster[15936]: serious structure inconsistency: (chunkid:00000000000000AD)
Sep 25 15:23:13 clbk-master mfsmaster[15936]: mfsmaster[15936]: serious structure inconsistency: (chunkid:00000000000000AD)
Sep 25 15:23:22 clbk-master mfsmaster[15936]: release: session not found
Sep 25 15:23:22 clbk-master mfsmaster[15936]: mfsmaster[15936]: release: session not found
nanowish commented 8 years ago

Same error with 3.10.0+dfsg-1 , work great with 3.9.4+dfsg-5~bpo8+1

MPaszkiewicz commented 8 years ago

@nanowish On which version of debian you got the error?

nanowish commented 8 years ago

All servers of the cluster are on Debian Jessie freshly installed

ghost commented 8 years ago

In my case: Arch Linux / LizardFS 3.10.2

blink69 commented 8 years ago

I put bug label on this, but I cannot reproduce this on our lab. I need more details:

nanowish commented 8 years ago

4 chunks 5 files, it's fresh install with 1 master, 2 chunkservers, 1 metalogger on chunk1 and cgi-srv on chunk2, and only 1 client connected all on debian 8.6 newly installed for the cluster

kazam_screenshot_00019

I tried again with 3.10.2 same error, and work great with version 3.9

blink69 commented 8 years ago

patch for that is in our CR http://cr.skytechnology.pl:8081/#/c/2720/

guestisp commented 7 years ago

I'm getting the same:

Apr 23 21:03:16 ale-XPS13 mfsmount[17493]: write file error, inode: 11, index: 0 - error sent by master server (Chunk lost) (try counter: 19)
Apr 23 21:03:26 ale-XPS13 mfsmaster[921]: mfsmaster[921]: serious structure inconsistency: (chunkid:000000000000000A)
Apr 23 21:03:26 ale-XPS13 mfsmaster[921]: serious structure inconsistency: (chunkid:000000000000000A)
Apr 23 21:03:26 ale-XPS13 mfsmount[17493]: write file error, inode: 11, index: 0 - error sent by master server (Chunk lost) (try counter: 20)
Apr 23 21:03:36 ale-XPS13 mfsmaster[921]: mfsmaster[921]: serious structure inconsistency: (chunkid:000000000000000A)
Apr 23 21:03:36 ale-XPS13 mfsmaster[921]: serious structure inconsistency: (chunkid:000000000000000A)
Apr 23 21:03:36 ale-XPS13 mfsmount[17493]: write file error, inode: 11, index: 0 - error sent by master server (Chunk lost) (try counter: 21)
Apr 23 21:03:47 ale-XPS13 mfsmaster[921]: mfsmaster[921]: serious structure inconsistency: (chunkid:000000000000000A)
Apr 23 21:03:47 ale-XPS13 mfsmaster[921]: serious structure inconsistency: (chunkid:000000000000000A)
Apr 23 21:03:47 ale-XPS13 mfsmount[17493]: write file error, inode: 11, index: 0 - error sent by master server (Chunk lost) (try counter: 22)
$ dpkg -l | grep lizard
ii  lizardfs-adm                                                3.10.4+dfsg-3                               amd64        LizardFS - administration tools
ii  lizardfs-cgi                                                3.10.4+dfsg-3                               all          LizardFS - CGI monitor
ii  lizardfs-cgiserv                                            3.10.4+dfsg-3                               amd64        simple CGI-capable HTTP server to run LizardFS CGI monitor
ii  lizardfs-chunkserver                                        3.10.4+dfsg-3                               amd64        LizardFS - data server
ii  lizardfs-client                                             3.10.4+dfsg-3                               amd64        LizardFS - client tools and mount utility
ii  lizardfs-common                                             3.10.4+dfsg-3                               all          LizardFS - common files
ii  lizardfs-master                                             3.10.4+dfsg-3                               amd64        LizardFS - master server
$ lsb_release -a
LSB Version:    core-9.20160110ubuntu5-amd64:core-9.20160110ubuntu5-noarch:security-9.20160110ubuntu5-amd64:security-9.20160110ubuntu5-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 17.04
Release:    17.04
Codename:   zesty
guestisp commented 7 years ago

Any hint ? This wasn't supposed to be fixed in 3.10.4 ?

I would like to try Lizard, but I'm unable to create any file (or, better, seems that i'm unable to change files. I'm able to create a new file, but unable to make changes to it)

psarna commented 7 years ago

Apr 23 21:03:26 ale-XPS13 mfsmount[17493]: write file error, inode: 11, index: 0 - error sent by master server (Chunk lost) (try counter: 20)

This roughly means that master server has information about a chunk, but no file actually has that chunk. It is indeed a structure inconsistency and it's not easy to reproduce such a thing. One way is to load metadata from very outdated/very corrupt metadata file.

OK, so these logs appear when you try writing to a certain file. Please, aside from the logs, provide what is the output of lizardfs fileinfo PATH_TO_AFFECTED_FILE. Second thing to do is to search chunk directories and see if these chunks are indeed missing. Something like find PATH_TO_CHUNKSERVER_HDD -name "*000000000000000A*". Finally, if it is a testing installation only and the files are not really confidential, it would probably helped if you provided affected metadata.mfs file from /var/lib directory.

Finally, is it easy for you to reproduce this behaviour on a fresh installation? If so, can you provide steps to do that?

Also, we'll try if perhaps this kind of thing is reproducible on Ubuntu 17 only, maybe it's some library incosistency.

guestisp commented 7 years ago

Yes, i'm able to reproduce this behaviour on a fresh install. This is a test "cluster" (1 chunkserver) where i'm trying to create a huge blank file (500MB) and then creating a filesystem inside of it.

The blank file is created properly, everything else will fail.

x@ale-XPS13:/mnt/lizard dd if=/dev/zero of=test.img bs=1M count=500
500+0 record in
500+0 record out
524288000 bytes (524 MB, 500 MiB) copied, 3,29313 s, 159 MB/s

x@ale-XPS13:/mnt/lizard$ lizardfs fileinfo test.img 
test.img:
    chunk 0: 000000000000001B_00000001 / (id:27 ver:1)
        copy 1: 192.168.1.84:9422:_
    chunk 1: 000000000000001C_00000001 / (id:28 ver:1)
        copy 1: 192.168.1.84:9422:_
    chunk 2: 000000000000001D_00000001 / (id:29 ver:1)
        copy 1: 192.168.1.84:9422:_
    chunk 3: 000000000000001E_00000001 / (id:30 ver:1)
        copy 1: 192.168.1.84:9422:_
    chunk 4: 000000000000001F_00000001 / (id:31 ver:1)
        copy 1: 192.168.1.84:9422:_
    chunk 5: 0000000000000020_00000001 / (id:32 ver:1)
        copy 1: 192.168.1.84:9422:_
    chunk 6: 0000000000000021_00000001 / (id:33 ver:1)
        copy 1: 192.168.1.84:9422:_
    chunk 7: 0000000000000022_00000001 / (id:34 ver:1)
        copy 1: 192.168.1.84:9422:_

Then, trying to run mkfs.ext4 test.img it start to log the corruption messages.

root@ale-XPS13:/tmp/chunk1# find -name "*000000000000001B*"
./chunks00/chunk_000000000000001B_00000001.mfs

For metadata, do you have an email where I can send it ?

psarna commented 7 years ago

Go with support@lizardfs.com, someone will check it tomorrow during office hours.

guestisp commented 7 years ago

Mail sent. Meanwhile, no other test to run ? I would like to test Lizard....

guestisp commented 7 years ago

Wait, i've double checked. After master timeout the operation :

Apr 24 20:53:07 ale-XPS13 mfsmount[6360]: write file error, inode: 3, index: 0 - error sent by master server (Chunk lost) (try counter: 30)
Apr 24 20:53:07 ale-XPS13 mfsmount[6360]: error writing file number 3: EIO (Input/output error)

chunks were totally removed:

# lizardfs fileinfo /mnt/lizard/test.img
/mnt/lizard/test.img:
/mnt/lizard/test.img [0]: No such chunk
# find /tmp/ -name "*000000000000001B*"
root@ale-XPS13:/tmp# 
psarna commented 7 years ago

https://github.com/lizardfs/lizardfs/issues/526 <- This issue is also related to wrong code generated by new GCC. Ubuntu 17.04 is new, so it has this high GCC version as well. Fix was added here: https://github.com/lizardfs/lizardfs/commit/422175eb1aa8c2a1e4d0727ee3fe8190e1ae1340

Can you check if this is solves the case?

psarna commented 7 years ago

*and it is solved in 3.10.6, but this patch can be cherry-picked to 3.10.4 without conflict.

guestisp commented 7 years ago

I have to manually build lizard? With 3.10.6 from you repo, I have to replace all paths and user permissions

Isn't possible to push the 3.10.6 to official Ubuntu repo? Because the version coming with Ubuntu is useless and not working at all

psarna commented 7 years ago

I have literally no power over official Ubuntu repos :)

psarna commented 7 years ago

Also, unification of paths and permissions is coming, but it's not easy to predict all paths to avoid potential problems with LizardFS upgrades, so it will take time to prepare these patches.

guestisp commented 7 years ago

I'll try to build a package from Ubuntu sources by cherry picking this patch. Should I build only the master ?

psarna commented 7 years ago

Master is enough for this patch to work.

guestisp commented 7 years ago

rebuilt the package by cherry-picking the patch. Now is working.