jthornber / thin-provisioning-tools

GNU General Public License v3.0
129 stars 73 forks source link

"Check of pool vg/thinpool failed (status:64). Manual repair required!" #265

Closed mailinglists35 closed 1 year ago

mailinglists35 commented 1 year ago

Hi, I am experiencing this after a power loss (but not sure if actually the power loss was the cause). The root filesystem is on a thin LV and the LV fails to activate (so I cannot even read the logs of what happened before).

Commands below while booted with sysresccd v10 iso.

Is this a correctable/repairable issue, and if yes, how? If not repairable, is there any way to rescue the LV data?

Thank you

[root@sysrescue ~]# lvmconfig
config {
}
local {
}
dmeventd {
}
activation {
        thin_pool_autoextend_threshold=90
        thin_pool_autoextend_percent=5
}
global {
        thin_check_options=["--ignore-non-fatal-errors","--clear-needs-check-flag"]
}
shell {
}
backup {
}
log {
        activation=1
}
allocation {
}
devices {
}

[root@sysrescue ~]# uname -r
6.1.20-1-lts
[root@sysrescue ~]# thin_check --version
thin_check 1.0.3

[root@sysrescue ~]# pvs
  PV         VG     Fmt  Attr PSize  PFree
  /dev/sdd2  mainvg lvm2 a--  <1.82t 870.23g

[root@sysrescue ~]# lvs -v 
  LV          VG     #Seg Attr       LSize    Maj Min KMaj KMin Pool     Origin Data%  Meta%  Move Cpy%Sync Log Convert LV UUID
LProfile
  exos.zcache mainvg    1 Vwi---tz--   32.00g  -1  -1   -1   -1 thinpool                                                JaPbHo-P7Te-cF8Y-cble-eEmf-Qy8Q-tACOsk

  exos.zlog   mainvg    1 Vwi---tz--    4.00g  -1  -1   -1   -1 thinpool                                                UGwx7D-XZlj-NrFM-fgpO-q5qD-PqAA-3g5ycF

  rootfs      mainvg    1 Vwi---tz--  128.00g  -1  -1   -1   -1 thinpool                                                ca9iq6-MYe3-qOdQ-XE5h-fSbN-nOh0-3eKWtq

  small       mainvg    1 Vwi---tz--   64.00g  -1  -1   -1   -1 thinpool                                                jEguxm-Cbvu-AVO9-bEsj-16wl-Z7aQ-mZqyAm

  swap        mainvg    1 Vwi---tz--   32.00g  -1  -1   -1   -1 thinpool                                                dEmPCV-r4oX-BTKU-DzWC-NXFt-tcDg-d2S0EU

  thinpool    mainvg    1 twi---tz-- <988.10g  -1  -1   -1   -1                                                         6URKTy-2232-vNXZ-itDX-3wxa-s445-b3XfNS thin-performance
  windows_vm  mainvg    1 Vwi---tz-- <256.13g  -1  -1   -1   -1 thinpool                                                NzKsD7-7X7g-uRXr-WCkx-zb2G-Rlio-rClJuh

  zfs.cache   mainvg    1 Vwi---tz--  256.00g  -1  -1   -1   -1 thinpool                                                hVL6CS-vMfd-k4dJ-2w2D-UGGc-aIUD-Oeg3S1

  zfs.logs    mainvg    1 Vwi---tz--    4.00g  -1  -1   -1   -1 thinpool                                                27nZ91-hLsX-FOHh-qVJR-h3Rw-snpq-6USe52

[root@sysrescue ~]# lvchange -ay mainvg/thinpool
TRANSACTION_ID=32
METADATA_FREE_BLOCKS=57711
1 nodes in data mapping tree contain errors
0 io errors, 0 checksum errors
Thin device 1 has 1 error nodes and is missing 17817 mappings, while expected 142318
Check of mappings failed
  Check of pool mainvg/thinpool failed (status:64). Manual repair required!

[root@sysrescue ~]# lvchange -vvvvay mainvg/rootfs
[attached because github allows maximum 65536 chars]

lvchange-vvvay.txt

Attached lvmdump as well lvmdump-sysrescue-20230526204414.zip

mailinglists35 commented 1 year ago

I've read https://github.com/jthornber/thin-provisioning-tools/issues/164 and https://access.redhat.com/solutions/3251681 but they address the status 1, while I have status 64

mailinglists35 commented 1 year ago
[root@sysrescue ~]# thin_dump --repair /dev/mapper/mainvg-thinpool_tmeta > /tmp/repaired.xml
node error: keys out of order BASVugEAEQ==
mailinglists35 commented 1 year ago
[root@sysrescue ~]# lvchange -ay mainvg/thinpool_tmeta
Do you want to activate component LV in read-only mode? [y/n]: y
  Allowing activation of component LV.

[root@sysrescue ~]# lvcreate -L1024M mainvg --name newmetaLV
  /dev/mainvg/newmetaLV: not found: device not cleared
  Aborting. Failed to wipe start of new LV.
mingnus commented 1 year ago

Hi,

We had addressed this issue in the recent commit. Could you please try building the pdata_tools binary, then run thin_dump --repair again?

Here's the build instruction. You might have to build it on another machine, then copy the built binary to your rescue environment via an USB drive or something. https://listman.redhat.com/archives/lvm-devel/2023-May/024788.html

mailinglists35 commented 1 year ago

thank you. I am in the middle of reinstalling the os on a separate drive, then once it's up I will attach the affected original drive and try the guide. will come back with feedback.

mailinglists35 commented 1 year ago
$ sudo ./pdata_tools thin_dump --repair /dev/mapper/mainvg-thinpool_tmeta
no compatible roots found
mailinglists35 commented 1 year ago

I have setup a dnf chroot in OL9 and built it, but getting above output when running pdata_tools:

sudo -E dnf --installroot=/var/local/ol9chroot/ --releasever=9 group install minimal-environment
sudo systemd-nspawn -D /var/local/ol9chroot/
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
yum install git
git clone https://github.com/jthornber/thin-provisioning-tools.git
cd thin-provisioning-tools/
yum install cargo
cargo build --release
# cargo build --release
    Updating crates.io index
    Updating git repository `https://github.com/zowens/crc32c`
    Updating git repository `https://github.com/jthornber/rio`
  Downloaded futures-io v0.3.28
  Downloaded memchr v2.5.0
  Downloaded autocfg v1.1.0
  Downloaded rustc_version v0.4.0
  Downloaded termcolor v1.2.0
  Downloaded semver v1.0.17
  Downloaded minimal-lexical v0.2.1
  Downloaded thiserror-impl v1.0.40
  Downloaded flate2 v1.0.25
  Downloaded cfg-if v1.0.0
  Downloaded atty v0.2.14
  Downloaded clap v3.2.23
  Downloaded futures-executor v0.3.28
  Downloaded futures-task v0.3.28
  Downloaded indexmap v1.9.3
  Downloaded hashbrown v0.12.3
  Downloaded futures-sink v0.3.28
  Downloaded data-encoding v2.3.3
  Downloaded clap_lex v0.2.4
  Downloaded miniz_oxide v0.6.2
  Downloaded pin-utils v0.1.0
  Downloaded ppv-lite86 v0.2.17
  Downloaded rand v0.8.5
  Downloaded rand_core v0.6.4
  Downloaded rand_chacha v0.3.1
  Downloaded quote v1.0.26
  Downloaded proc-macro2 v1.0.56
  Downloaded indicatif v0.17.3
  Downloaded slab v0.4.8
  Downloaded unicode-segmentation v1.10.1
  Downloaded zstd-safe v5.0.2+zstd.1.5.2
  Downloaded byteorder v1.4.3
  Downloaded unicode-width v0.1.10
  Downloaded zstd v0.11.2+zstd.1.5.2
  Downloaded unicode-ident v1.0.8
  Downloaded futures-channel v0.3.28
  Downloaded crc32fast v1.3.2
  Downloaded jobserver v0.1.26
  Downloaded number_prefix v0.4.0
  Downloaded num-derive v0.3.3
  Downloaded lazy_static v1.4.0
  Downloaded portable-atomic v0.3.19
  Downloaded thiserror v1.0.40
  Downloaded threadpool v1.8.1
  Downloaded roaring v0.10.1
  Downloaded retain_mut v0.1.7
  Downloaded tui v0.16.0
  Downloaded termion v1.5.6
  Downloaded numtoa v0.1.0
  Downloaded rangemap v1.3.0
  Downloaded zstd-sys v2.0.8+zstd.1.5.5
  Downloaded quick-xml v0.23.1
  Downloaded libc v0.2.142
  Downloaded syn v1.0.109
  Downloaded textwrap v0.16.0
  Downloaded futures-core v0.3.28
  Downloaded anyhow v1.0.70
  Downloaded getrandom v0.2.9
  Downloaded futures-util v0.3.28
  Downloaded futures-macro v0.3.28
  Downloaded console v0.15.5
  Downloaded fixedbitset v0.4.2
  Downloaded cc v1.0.79
  Downloaded bitflags v1.3.2
  Downloaded bytemuck v1.13.1
  Downloaded adler v1.0.2
  Downloaded syn v2.0.15
  Downloaded strsim v0.10.0
  Downloaded safemem v0.3.3
  Downloaded pkg-config v0.3.26
  Downloaded os_str_bytes v6.5.0
  Downloaded pin-project-lite v0.2.9
  Downloaded num_cpus v1.15.0
  Downloaded num-traits v0.2.15
  Downloaded nom v7.1.3
  Downloaded iovec v0.1.4
  Downloaded futures v0.3.28
  Downloaded exitcode v1.1.2
  Downloaded cassowary v0.3.0
  Downloaded base64 v0.20.0
  Downloaded 80 crates (4.9 MB) in 1.59s
  Downloaded duct v0.13.6
  Downloaded 1 crate (29.3 KB) in 0.17s
  Downloaded json v0.12.4
  Downloaded 1 crate (105.9 KB) in 0.16s
  Downloaded mockall v0.11.4
  Downloaded 1 crate (22.4 KB) in 0.15s
  Downloaded quickcheck v0.9.2
  Downloaded 1 crate (27.4 KB) in 0.90s
  Downloaded quickcheck_macros v0.9.1
  Downloaded 1 crate (4.2 KB) in 0.79s
  Downloaded tempfile v3.5.0
  Downloaded 1 crate (31.1 KB) in 0.16s
   Compiling libc v0.2.142
   Compiling autocfg v1.1.0
   Compiling proc-macro2 v1.0.56
   Compiling unicode-ident v1.0.8
   Compiling quote v1.0.26
   Compiling memchr v2.5.0
   Compiling pkg-config v0.3.26
   Compiling cfg-if v1.0.0
   Compiling semver v1.0.17
   Compiling futures-core v0.3.28
   Compiling futures-task v0.3.28
   Compiling futures-channel v0.3.28
   Compiling futures-util v0.3.28
   Compiling slab v0.4.8
   Compiling futures-sink v0.3.28
   Compiling indexmap v1.9.3
   Compiling rustc_version v0.4.0
   Compiling syn v2.0.15
   Compiling pin-project-lite v0.2.9
   Compiling pin-utils v0.1.0
   Compiling syn v1.0.109
   Compiling getrandom v0.2.9
   Compiling jobserver v0.1.26
   Compiling portable-atomic v0.3.19
   Compiling zstd-safe v5.0.2+zstd.1.5.2
   Compiling crc32fast v1.3.2
   Compiling cc v1.0.79
   Compiling futures-io v0.3.28
   Compiling crc32c v0.6.3 (https://github.com/zowens/crc32c?branch=master#3779fe88)
   Compiling rand_core v0.6.4
   Compiling num-traits v0.2.15
   Compiling thiserror v1.0.40
   Compiling hashbrown v0.12.3
   Compiling anyhow v1.0.70
   Compiling os_str_bytes v6.5.0
   Compiling adler v1.0.2
   Compiling ppv-lite86 v0.2.17
   Compiling unicode-width v0.1.10
   Compiling lazy_static v1.4.0
   Compiling console v0.15.5
   Compiling clap_lex v0.2.4
   Compiling miniz_oxide v0.6.2
   Compiling rand_chacha v0.3.1
   Compiling num_cpus v1.15.0
   Compiling atty v0.2.14
   Compiling textwrap v0.16.0
   Compiling zstd-sys v2.0.8+zstd.1.5.5
   Compiling byteorder v1.4.3
   Compiling minimal-lexical v0.2.1
   Compiling bytemuck v1.13.1
   Compiling bitflags v1.3.2
   Compiling number_prefix v0.4.0
   Compiling strsim v0.10.0
   Compiling retain_mut v0.1.7
   Compiling termcolor v1.2.0
   Compiling clap v3.2.23
   Compiling roaring v0.10.1
   Compiling indicatif v0.17.3
   Compiling nom v7.1.3
   Compiling threadpool v1.8.1
   Compiling flate2 v1.0.25
   Compiling rand v0.8.5
   Compiling num-derive v0.3.3
   Compiling iovec v0.1.4
   Compiling futures-macro v0.3.28
   Compiling thiserror-impl v1.0.40
   Compiling quick-xml v0.23.1
   Compiling rangemap v1.3.0
   Compiling data-encoding v2.3.3
   Compiling fixedbitset v0.4.2
   Compiling base64 v0.20.0
   Compiling exitcode v1.1.2
   Compiling safemem v0.3.3
   Compiling futures-executor v0.3.28
   Compiling futures v0.3.28
   Compiling zstd v0.11.2+zstd.1.5.2
   Compiling thinp v1.0.4 (/root/thin-provisioning-tools)
    Finished release [optimized + debuginfo] target(s) in 4m 03s

then exit the chroot, cd /var/local/ol9chroot/root/thin-provisioning-tools/target/release and run ./pdata_tools

mingnus commented 1 year ago
$ sudo ./pdata_tools thin_dump --repair /dev/mapper/mainvg-thinpool_tmeta
no compatible roots found

Could you help run thin_dump with option -v please? ./pdata_tools thin_dump -v --repair /dev/mapper/mainvg-thinpool_tmeta

If possible, could you help send me the compressed metadata file for me to take a look? ./pdata_tools thin_metadata_pack -i /dev/mapper/mainvg-thinpool_tmeta -o tmeta.pack

mikedilger commented 1 year ago

I have just experienced a "Check of pool qubes_dom0/root-pool failed (status:64). Manual repair required!" in an entirely different context (I'm not using this repo at all, I'm trying to recover a broken qubes os). I find it interesting that this was the only hit on google for this error with status:64 on this issue and it's happening in the last few days to someone else too. Must be a recent bug perhaps in LVM. I'm on ArchLinux.

mingnus commented 1 year ago

If you're talking about the status code itself, it's because the recently released thin-provisioning-tools v1.0.x uses a different exit code rather than 1 as in the previous versions. That's why you don't see much search results with status:64 so far. You could run LVM commands with -vvv to make sure the status:64 was came from thin_check.

On the other hand, the new thin_check performs a more comprehensive check on the metadata, so you might experiences errors after upgrading the tools. You could choose either fix or ignores those non-fatal errors, by using options --auto-repair or --ignore-non-fatal-errors. See issue #242 for more information.

mikedilger commented 1 year ago

Oh, I entirely misunderstood. Yes, these thin_* things are on ArchLinux. I didn't realize. Anyhow, I did my LVM stuff on an older system and had no issues, I just thought my data point might be relevant, but let me not hijack this. Good luck.

mailinglists35 commented 1 year ago
$ sudo ./pdata_tools thin_dump --repair /dev/mapper/mainvg-thinpool_tmeta
no compatible roots found

Could you help run thin_dump with option -v please? ./pdata_tools thin_dump -v --repair /dev/mapper/mainvg-thinpool_tmeta

If possible, could you help send me the compressed metadata file for me to take a look? ./pdata_tools thin_metadata_pack -i /dev/mapper/mainvg-thinpool_tmeta -o tmeta.pack

sure. here is the output:

$ sudo ./pdata_tools thin_dump -v --repair /dev/mapper/mainvg-thinpool_tmeta
mapping candidates (0):

device candidates (2639):
b=148, nr_devices=8, nr_mappings=9231990, max_tid=30, age=1
b=306, nr_devices=8, nr_mappings=9231736, max_tid=30, age=1
#[...] long list text suppressed
b=128272, nr_devices=8, nr_mappings=9234495, max_tid=30, age=1
b=128610, nr_devices=8, nr_mappings=9233372, max_tid=30, age=1
b=128936, nr_devices=8, nr_mappings=9232379, max_tid=30, age=1

compatible roots (0):
no compatible roots found

here is the uploaded tmeta.pack: https://pastefile.com/ftcc83 34MB (github only alllows 25MB), password is case sensitive exactly the name that appears under your github profile including dash and space

note you will observe that the vg name is actually different but wanted to protect it's name online.

mingnus commented 1 year ago

Hi,

I would like to know, is the data in the thin device with id#0 valuable? The thin device 0 was affected by few broken metadata blocks, while others are fine. The easiest way to rebuild the metadata is to drop the thin device 0, so you could still have other volumes readable. If this way is not acceptable, I could help recover the mappings for the device 0.

mailinglists35 commented 1 year ago

I am unsure how do I identify device with id#0 inside the thin pool, but if they are sorted by id, and id#0 is the first on the list, it can be safely dropped (exos.zcache):

  LV               VG     Attr       LSize    Pool     Origin Data%  Meta%  Move Log Cpy%Sync Convert
  exos.zcache      mainvg Vwi---tz--   32.00g thinpool
  exos.zlog        mainvg Vwi---tz--    4.00g thinpool
  [lvol0_pmspare]  mainvg ewi-------  136.00m
  rootfs           mainvg Vwi---tz--  128.00g thinpool
  small            mainvg Vwi---tz--   64.00g thinpool
  swap             mainvg Vwi---tz--   32.00g thinpool
  thinpool         mainvg twi---tz-- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  windows_vm       mainvg Vwi---tz-- <256.13g thinpool
  zfs.cache        mainvg Vwi---tz--  256.00g thinpool
  zfs.logs         mainvg Vwi---tz--    4.00g thinpool
mailinglists35 commented 1 year ago

oh, they're sorted alphabetically. does the metadata allow you to map the id to one of the names above?

mailinglists35 commented 1 year ago

lvs manual page only says how to sort, but no detail what criterias are available to sort

       -O|--sort String
              Comma-separated  ordered  list of columns to sort by. Replaces the default selection. Precede any column with - for a re‐
              verse sort on that column.
mailinglists35 commented 1 year ago

oh, it displays them on -O anytexthere maybe this? but I see no 0 id, they begin with 1?

$ sudo lvs -a mainvg -O thin_id -o+thin_id
  LV               VG     Attr       LSize    Pool     Origin Data%  Meta%  Move Log Cpy%Sync Convert ThId
  swap             mainvg Vwi---tz--   32.00g thinpool                                                   1
  rootfs           mainvg Vwi---tz--  128.00g thinpool                                                   2
  small            mainvg Vwi---tz--   64.00g thinpool                                                   3
  zfs.logs         mainvg Vwi---tz--    4.00g thinpool                                                   4
  zfs.cache        mainvg Vwi---tz--  256.00g thinpool                                                   5
  windows_vm       mainvg Vwi---tz-- <256.13g thinpool                                                   7
  exos.zcache      mainvg Vwi---tz--   32.00g thinpool                                                   8
  exos.zlog        mainvg Vwi---tz--    4.00g thinpool                                                   9
  thinpool         mainvg twi---tz-- <988.10g
  [lvol0_pmspare]  mainvg ewi-------  136.00m
  [lvol0_pmspare]  mainvg ewi-------  136.00m
  [lvol0_pmspare]  mainvg ewi-------  136.00m
  [lvol0_pmspare]  mainvg ewi-------  136.00m
  [lvol0_pmspare]  mainvg ewi-------  136.00m
  [lvol0_pmspare]  mainvg ewi-------  136.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tmeta] mainvg eRi-a-----  504.00m
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
  [thinpool_tdata] mainvg Twi------- <988.10g
mailinglists35 commented 1 year ago

if swap is the thin device 0, then yes it can be safely abandoned

mingnus commented 1 year ago

oh, it displays them on -O anytexthere maybe this? but I see no 0 id, they begin with 1?

my mistake... you're right, it's the swap volume with thin_id 1.

You can choose the devices you want with the --dev-id option, e.g., we want to dump all the devices except the id-1.

./pdata_tools thin_dump tmeta.bin  --dev-id 2 --dev-id 3 --dev-id 4 --dev-id 5 --dev-id 7 --dev-id 8 --dev-id 9 -o dump.xml

(I know it looks stupid especially when you have hundreds of devices, just because the option was not designed for restoration)

The md5sum for dump.xml is 0f8253a136a52131c494bbc0f2e04bac

In addition, if you would like to make the swap volume available (but empty), just put the following two lines after the <superblock> tag:

  <device dev_id="1" mapped_blocks="142318" transaction="4" creation_time="0" snap_time="0">
  </device>

Restore metadata into a newly created metadata volume, swap it into the thin-pool, and you should be able to access the pool.

lvcreate mainvg --size 512m --name oldmeta
thin_restore -i dump.txt -o /dev/mapper/mainvg-oldmeta
lvconvert mainvg/thinpool --swapmetadata --poolmetadata mainvg/oldmeta
mailinglists35 commented 1 year ago

thank you so much!

I am now waiting for full disk dd backup to finish then I will report back the result of restore

mailinglists35 commented 1 year ago

done! final step was to lvremove oldmeta, then I vgchange -an mainvg / vgchange -ay mainvg and lvchange -ay. thank you so much @mingnus !

if anyone else hits this particular case, the tmeta.bin referenced above is actually the dev mapper meta lv name - dev has used it on a plain file to reproduce and solve the issue.

mailinglists35 commented 1 year ago

as a followup suggestion, do you think the work you did to identify which lv are recoverable can be automated so then can be merged into the project? so to be able to list the recoverable lv's with some command, then run the thin_dump dev-id command

mingnus commented 1 year ago

Sure, it's worth to make it fully automated. I would prefer turn it into a new tool like thin_rescue or something, to avoid being confused with the current thin_dump/thin_repair.

Another question from me: What's the distro & kernel version you had used to run the thin-pool? Had you experienced similar errors before? (where thin_check doesn't pass)

mailinglists35 commented 1 year ago

it was ubuntu 18.04 with hwe kernel 5.15

never experienced such situation, (I did not have automatic shutdown configured and when the power went off and I knew I am on UPS borrowed time, I started manually shutting down services instead of issuing a poweroff. the ups died before finishing my manual services stop routine - there must have been something writing to swap at that time, which was the corrupted lv).

PS: and to anyone else reading here, yes, I had the write cache enabled. if you care about your data, don't do like me; disable write cache OR ensure your power redundacy works until shutdown.

jsachs commented 5 months ago

How were you able to determine the problematic thin ID above? I'm currently running into the same issue with "no compatible roots found".

mingnus commented 5 months ago

The thin_check v1.0.x does the job. It shows that device ID#1 was affected in this case:

TRANSACTION_ID=32
METADATA_FREE_BLOCKS=57711
1 nodes in data mapping tree contain errors
0 io errors, 0 checksum errors
Thin device 1 has 1 errors and is missing 17817 mappings, while expected 142318
Check of mappings failed
jsachs commented 5 months ago

Once I've identified the bad devices, is there any hope of doing a manual cleanup on my own? --auto-repair identifies that there is a bad superblock, and thin_check sees issues in two different devices (they are both proxmox vms, so things I'd like to try to recover).

mingnus commented 5 months ago

If there's any mapping candidate listed in thin_dump --repair -v, you could try the recent upstream commit that helps repair the device details tree. Alternatively, you could dump incomplete mappings with thin_dump v0.9.0 if you accept the risk of data loss (this feature is a bit tricky so it's removed from v1.0). Or you could send me the packed metadata if you're not sure what to do.

jsachs commented 5 months ago

Running thin_dump resulted in the same "no compatible roots" issue above.

Running thin_metadata_pack produced a file, but also a generic I/O error on completion.

The .pack file is at pastefile.com/dw24dd

jsachs commented 5 months ago
user@debian:~$ sudo thin_check /dev/mapper/pve-data_tmeta --auto-repair -v
TRANSACTION_ID=105
METADATA_FREE_BLOCKS=373659
number of devices to check: 9
nr internal nodes: 22
nr leaves: 2978
Thin device 5 has 1 errors and is missing 247 mappings, while expected 68360
Thin device 7 has 17 errors and is missing 3348 mappings, while expected 130056
Check of mappings failed
mingnus commented 5 months ago

What kind of error messages did you get from thin_metadata_pack? I'm wondering there's io error in reading the source metadata device and therefore I get different thin_check outputs than yours:

# thin_check -v tmeta.bin
TRANSACTION_ID=105
METADATA_FREE_BLOCKS=373659
number of devices to check: 9
nr internal nodes: 18
nr leaves: 1614
742 nodes in data mapping tree contain errors
0 io errors, 742 checksum errors
Thin device 2 has 81 errors and is missing 86964 mappings, while expected 135410
Thin device 5 has 31 errors and is missing 6289 mappings, while expected 68360
Thin device 7 has 255+ errors and is missing 53080 mappings, while expected 130056
Thin device 8 has 103 errors and is missing 19837 mappings, while expected 32781
Thin device 9 has 71 errors and is missing 14386 mappings, while expected 17997
Thin device 10 has 108 errors and is missing 45454 mappings, while expected 62684
Thin device 12 has 68 errors and is missing 38761 mappings, while expected 99297
Check of mappings failed

Do you get consistent outputs from thin_check across different runs?

jsachs commented 5 months ago

thin_check gives the same output every time for me.

There is no verbose option for thin_metadata_pack, I don't think, but I get:

"Input/output error (os error 5)".

This is the error that for me is non-deterministic, in that sometimes the error raises quickly, and other times after a much longer time.

jthornber commented 5 months ago

It really sounds like you have hardware problems. perhaps try stracing thin_metadata_pack to see what's failing (and repeat to see if the failure is the same).

On Thu, 20 Jun 2024 at 14:23, Jacob Sachs @.***> wrote:

thin_check gives the same output every time for me.

There is no verbose option for thin_metadata_pack, I don't think, but I get:

"Input/output error (os error 5)".

This is the error that for me is non-deterministic, in that sometimes the error raises quickly, and other times after a much longer time.

— Reply to this email directly, view it on GitHub https://github.com/jthornber/thin-provisioning-tools/issues/265#issuecomment-2180680745, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABOSQ42VENYW7Z5J6H65ULZILJULAVCNFSM6AAAAABJRMUKMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBQGY4DANZUGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mingnus commented 5 months ago

thin_metadata_pack runs in buffered io. The errors in packed metadata might come from short reads when reading series of blocks. Maybe you could check the error counter from thin_check's logs (# io errors, # checksum errors), and stracing thin_check as well.