openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.69k stars 1.76k forks source link

zfs snapshot - and postgres 10.2 leading to corrupted data files #7461

Closed thoro closed 4 years ago

thoro commented 6 years ago

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 7
Linux Kernel 4.12.5 #1 SMP Fri Aug 11 09:55:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Architecture x64
ZFS Version 0.7.0-182_gc0daec3
SPL Version 0.7.0-20_g35a44fc

Describe the problem you're observing

When creating a snapshot with zfs snapshot, while a high iops Postgres 10.2 instance is writing data to the file system, files get corrupted.

I'm not sure if this is because of the non-release version number, if a commit is known, please reference it.

Describe how to reproduce the problem

I can't easily say how to reproduce it, but it happened already at least 3 times whenever a zfs snapshot was taken.

Include any warning/errors/backtraces from the system logs

beren12 commented 6 years ago

The snapshotted files get corrupted or the live files?

thoro commented 6 years ago

Live files.

New update:

I started zfs send immediatly after taking the snapshot before, our DB showed immediate corruption errors in indices and in one table. We recreated the indices to at least keep operating. (there's enough backups etc)

After the zfs send was canceled (connection to server interrupted) the affected table went back to normal without a corruption error (I had a query were I could consistently reproduce the error) -> maybe postgres did something, or maybe zfs not sure.

BTW modprobe parameters:

options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
options zfs zfs_vdev_sync_write_min_active=64
options zfs zfs_vdev_sync_write_max_active=128
options zfs zfs_vdev_sync_read_min_active=64
options zfs zfs_vdev_sync_read_max_active=128
options zfs zfs_vdev_async_read_min_active=64
options zfs zfs_vdev_async_read_max_active=128
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=5
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=64
options zfs zfs_prefetch_disable=1
options zfs zfs_compressed_arc_enabled=0

Maybe there's some old option which shouldn't be used anymore?

ryao commented 6 years ago

@Thoro To be clear here, are you saying that the running PostgreSQL instance on system A whose database was snapshoted began to experience corruption when you began a zfs send and that things went back to normal when you cancelled the send?

I am sure that you have an issue, but I want to be absolutely certain that the corruption was not detected by a PostgreSQL instance on the receiving side. That scenario would indicate a regression in PostgreSQL that breaks ACID compliance.

Presently, I cannot imagine a scenario in which a zfs send should affect the active filesystem. There might be some extremely subtle nuance in our interaction with the page cache, but it is too early to tell. I notice that you are not using the 3.10 kernel that CentOS normally uses. Would you be able to try switching back to that and seeing whether the issues persist? CentOS' 3.10 kernel is far better tested and having my hands full with #7423, I cannot promise to have time to do much here in the short term.

You are running an older version of ZoL HEAD. Is there any specific reason for that? If not and your pool is not using any newer features than what are supported in 0.7.x, you might want to switch to 0.7.8. I have been sufficiently out of the loop that I am not familiar enough with the development of HEAD over the past several months to know if there is anything like this that has been fixed, but having you on either 0.7.8 or a more recent HEAD would be helpful to the ZFSOnLinux team in eliminating variables that we need to consider.

The same goes for CentOS' 3.10 kernel versus 4.12.5. If things are fine on 3.10, but have issues on 4.12.5, we would want to look at what is different between the two kernels in the code paths that PostgreSQL uses. Also, I want to be clear that we do support 4.12.5, but knowing whether 3.10 is affected would help reduce the amount of developer time needed on this and enable us to figure out what is going wrong sooner.

Also, since I have my hands full with #7423, other project members will likely handle this one. This is just some initial triage to help them take care of this faster.

thoro commented 6 years ago

@ryao first off, it's not an issue at all for our system, we don't rely on the snapshotting feature, but in case we find out it's a bug we should figure why it's there! I will try to create a test system if I can reproduce.

I'm not sure if zfs snapshot or zfs send had any effect, but I noticed that our logs were showing the corruptions a few minutes after the snapshot was taken. The snapshot was on the postgres system showing the corruption errors. The target of the zfs send was a different server (sending via mbuffer) - this one was ideling.

To answer the other questions:

Regarding kernel -> I will setup the test system and see if I can reproduce this, hopefully pg_bench results in the same load pattern

thoro commented 6 years ago

We couldn't reproduce on a test system, but yesterday I switched again the databases and tried to restore the second system from a zfs snapshot made immediately afterwards.

Seconds after the snapshot was made, Postgres Indices were corrupted with the message: "SQLSTATE[XX001]: Data corrupted: 7 ERROR: invalid page in block 36451 of relation base/16400/112770"

At first I thought it's just the indices but also data files were affected. Even though would revise my last statement with the zfs send, something else must of interferred back then, now I would say it's purely the snapshot command - after stopping zfs send and deleting the snapshot nothing was fixed.

This time I was using a released version, and not a HEAD snapshot.

filename:       /lib/modules/4.12.5/extra/zfs/zfs/zfs.ko
version:        0.7.8-1
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     6DFACEA9ADD42F919B32C68
depends:        spl,znvpair,zcommon,zunicode,zavl,icp
vermagic:       4.12.5 SMP mod_unload modversions
rincebrain commented 6 years ago

It's not clear to me what you're trying to describe here.

If I understand correctly, you're saying that

According to Postgres's own documentation on backing up the filesystem:

The database server must be shut down in order to get a usable backup.

It also observes:

An alternative file-system backup approach is to make a “consistent snapshot” of the data directory, if the file system supports that functionality (and you are willing to trust that it is implemented correctly). The typical procedure is to make a “frozen snapshot” of the volume containing the database, then copy the whole data directory (not just parts, see above) from the snapshot to a backup device, then release the frozen snapshot. This will work even while the database server is running. However, a backup created in this way saves the database files in a state as if the database server was not properly shut down; therefore, when you start the database server on the backed-up data, it will think the previous server instance crashed and will replay the WAL log. This is not a problem; just be aware of it (and be sure to include the WAL files in your backup). If your database is spread across multiple file systems, there might not be any way to obtain exactly-simultaneous frozen snapshots of all the volumes. For example, if your data files and WAL log are on different disks, or if tablespaces are on different file systems, it might not be possible to use snapshot backup because the snapshots must be simultaneous. Read your file system documentation very carefully before trusting the consistent-snapshot technique in such situations.

You haven't said anything about the layout of the filesystems or the Postgres data locations, so I can't begin to guess which of these applies, but since our documentation for performance says you should put the WAL and data on different datasets (for perfectly good reasons), you would need to use snapshot -r to get all the filesystems in the pool that you're trying to back up to snapshot at precisely the same time, and the same caveats about Postgres treating it like a live system whose power you pulled still apply.

thoro commented 6 years ago

I'm aware of all of this, but the data corruption affects system1 and pool1, therefore all suggestions regarding postgres backuping etc. are irrelevant, the file system shouldn't corrupt when taking a snapshot, that's the main staple of ZFS...

system2 and pool2 also have these data errors in the zfs received filesystem.

The structure is as easy as it gets:

Single directory, single file system, 16K Block size, lz4 compression

Here all settings, maybe you see something obviously wrong:

storage/database  type                  filesystem                       -
storage/database  creation              Thu Apr 19 12:22 2018            -
storage/database  used                  629G                             -
storage/database  available             254G                             -
storage/database  referenced            629G                             -
storage/database  compressratio         2.22x                            -
storage/database  mounted               yes                              -
storage/database  quota                 none                             default
storage/database  reservation           none                             default
storage/database  recordsize            16K                              inherited from storage
storage/database  mountpoint            /data/fast/creamdata-db04/data/  local
storage/database  sharenfs              off                              default
storage/database  checksum              on                               default
storage/database  compression           lz4                              inherited from storage
storage/database  atime                 off                              inherited from storage
storage/database  devices               on                               default
storage/database  exec                  on                               default
storage/database  setuid                on                               default
storage/database  readonly              off                              default
storage/database  zoned                 off                              default
storage/database  snapdir               hidden                           default
storage/database  aclinherit            restricted                       default
storage/database  createtxg             1159021                          -
storage/database  canmount              on                               default
storage/database  xattr                 sa                               inherited from storage
storage/database  copies                1                                default
storage/database  version               5                                -
storage/database  utf8only              off                              -
storage/database  normalization         none                             -
storage/database  casesensitivity       sensitive                        -
storage/database  vscan                 off                              default
storage/database  nbmand                off                              default
storage/database  sharesmb              off                              default
storage/database  refquota              none                             default
storage/database  refreservation        none                             default
storage/database  guid                  8214081110063784152              -
storage/database  primarycache          all                              default
storage/database  secondarycache        all                              default
storage/database  usedbysnapshots       0B                               -
storage/database  usedbydataset         629G                             -
storage/database  usedbychildren        0B                               -
storage/database  usedbyrefreservation  0B                               -
storage/database  logbias               latency                          default
storage/database  dedup                 off                              default
storage/database  mlslabel              none                             default
storage/database  sync                  standard                         default
storage/database  dnodesize             legacy                           default
storage/database  refcompressratio      2.22x                            -
storage/database  written               629G                             -
storage/database  logicalused           1.36T                            -
storage/database  logicalreferenced     1.36T                            -
storage/database  volmode               default                          default
storage/database  filesystem_limit      none                             default
storage/database  snapshot_limit        none                             default
storage/database  filesystem_count      none                             default
storage/database  snapshot_count        none                             default
storage/database  snapdev               hidden                           default
storage/database  acltype               off                              default
storage/database  context               none                             default
storage/database  fscontext             none                             default
storage/database  defcontext            none                             default
storage/database  rootcontext           none                             default
storage/database  relatime              off                              default
storage/database  redundant_metadata    all                              default
storage/database  overlay               off                              default

Additionally zpool:

NAME                                              STATE     READ WRITE CKSUM
    storage                                           ONLINE       0     0     0
      mirror-0                                        ONLINE       0     0     0
        wwn-0x600304801947ff012215da8608cc28f2-part3  ONLINE       0     0     0
        wwn-0x600304801947ff012215da9409aede95        ONLINE       0     0     0
        wwn-0x600304801947ff012215da6e076390fd-part3  ONLINE       0     0     0

Settings zpool:

zpool get all storage
NAME     PROPERTY                          VALUE                             SOURCE
storage  size                              912G                              -
storage  capacity                          69%                               -
storage  altroot                           -                                 default
storage  health                            ONLINE                            -
storage  guid                              10731124542107270489              -
storage  version                           -                                 default
storage  bootfs                            -                                 default
storage  delegation                        on                                default
storage  autoreplace                       off                               default
storage  cachefile                         -                                 default
storage  failmode                          wait                              default
storage  listsnapshots                     off                               default
storage  autoexpand                        off                               default
storage  dedupditto                        0                                 default
storage  dedupratio                        1.00x                             -
storage  free                              283G                              -
storage  allocated                         629G                              -
storage  readonly                          off                               -
storage  ashift                            12                                local
storage  comment                           -                                 default
storage  expandsize                        -                                 -
storage  freeing                           0                                 -
storage  fragmentation                     81%                               -
storage  leaked                            0                                 -
storage  multihost                         off                               default
storage  feature@async_destroy             enabled                           local
storage  feature@empty_bpobj               active                            local
storage  feature@lz4_compress              active                            local
storage  feature@multi_vdev_crash_dump     enabled                           local
storage  feature@spacemap_histogram        active                            local
storage  feature@enabled_txg               active                            local
storage  feature@hole_birth                active                            local
storage  feature@extensible_dataset        active                            local
storage  feature@embedded_data             active                            local
storage  feature@bookmarks                 enabled                           local
storage  feature@filesystem_limits         enabled                           local
storage  feature@large_blocks              enabled                           local
storage  feature@large_dnode               enabled                           local
storage  feature@sha512                    enabled                           local
storage  feature@skein                     enabled                           local
storage  feature@edonr                     enabled                           local
storage  feature@userobj_accounting        active                            local
storage  unsupported@com.datto:encryption  inactive                          local

Disks are Samsung 850 Pro, behind a LSI SMC3108, no cache, no raid devices defined, pure JBOD. Ashift is corrected already to 12, since the 850 Pro seem to have a garbarge collection issue when it's set to 9 (can't find the source now, but I can vouch that it produces checksum errors after a while)

rincebrain commented 6 years ago

What do you mean, it affects pool1 on system1?

You said you have a live Postgres server running there, doing reads and writes. What series of commands are you doing, exactly, when it reports problems, and what problems does it report?

Does the problem still go away if you destroy the snapshot you took and don't rebuild the indices?

Does this only happen if you do a zfs send after taking a snapshot, or is it merely taking the snapshot that's sufficient?

Does this still happen if you don't have zfs_compressed_arc_enabled=0 ? (I think you would need to actually reboot with that setting change removed to be certain of the answer to this question, since it's not going to just drop the entire ARC contents if it allows you to change that at runtime.) (I doubt this is the problem, but there are very few cases I've seen with people using that setting, so I would be less surprised than otherwise if there was some uncommon problem that crops up with it...)

thoro commented 6 years ago

I mean, the Postgres 10 server on system1 pool1 reports data corruption errors posted above.

zfs snapshot storage/database@speedfix zfs send -c storage/database@speedfix | mbuffer -s 128k -m 1G -O 10.0.99.11:1337

I did that around midnight my time (vienna, sadly exact timing is hard to say, I really have to enable timestamps for the console.). First errors appeared at 2018-09-15 22:41:05.644 UTC -> around 00:40 vienna time. -> okay, not immediately after, I was confused by time zones (yeah) exact error:

[2018-09-15 22:41:05.644 UTC|10.0.99.32(43983)]ERROR:  invalid page in block 155421 of relation base/16400/1724155

Probably it's important to note that during the zfs send some heavy processing was ongoing (normal operations during night).

Deleting the snapshot did not change the corrupted files.

It's hard to say if the zfs send has an impact, since I was sleeping and it was running. But it took some time to be corrupted possibly there's really a connection.

I will remove the setting, but since it's a live db I'm not really keen on breaking it again after the backups are restored - and as I said I couldn't reproduce it with a test system. -> would a unload of the zfs module and reload update the setting without a reboot?

rincebrain commented 6 years ago

Yes, unloading the modules and reloading them would drop that information.

ZFS isn't reporting any errors at any point? What do zpool status and zpool get all say?

Did you try both the kernel you're running on the production system and the stock CentOS 7 kernel on your test system?

If you can't reproduce it on the test system, it might make sense to make the test system serve the data if you want to debug it further. ("We" would obviously prefer to figure out what's wrong, but you obviously would like it to Just Work.)

I can't seem to find any suggestion of this being a problem anyone's seen on ZFS before - what's the hardware this is running on, does it have a BMC you could ask if there were any exciting hardware noises from, does it have ECC RAM, if you run memtest for a while does it pass, etc? (Postgres's docs on corruption are suggesting that most often it's caused by hardware faults, though I'm still pondering how that would arise without ZFS making unhappy noises.)

thoro commented 6 years ago

Okay - great, can do that.

No errors what so ever:

zpool status
  pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 2h10m with 0 errors on Thu Jul 26 19:50:53 2018
config:

    NAME                                              STATE     READ WRITE CKSUM
    storage                                           ONLINE       0     0     0
      mirror-0                                        ONLINE       0     0     0
        wwn-0x600304801947ff012215da8608cc28f2-part3  ONLINE       0     0     0
        wwn-0x600304801947ff012215da9409aede95        ONLINE       0     0     0
        wwn-0x600304801947ff012215da6e076390fd-part3  ONLINE       0     0     0

errors: No known data errors
zpool get all storage
NAME     PROPERTY                          VALUE                             SOURCE
storage  size                              912G                              -
storage  capacity                          69%                               -
storage  altroot                           -                                 default
storage  health                            ONLINE                            -
storage  guid                              10731124542107270489              -
storage  version                           -                                 default
storage  bootfs                            -                                 default
storage  delegation                        on                                default
storage  autoreplace                       off                               default
storage  cachefile                         -                                 default
storage  failmode                          wait                              default
storage  listsnapshots                     off                               default
storage  autoexpand                        off                               default
storage  dedupditto                        0                                 default
storage  dedupratio                        1.00x                             -
storage  free                              282G                              -
storage  allocated                         630G                              -
storage  readonly                          off                               -
storage  ashift                            12                                local
storage  comment                           -                                 default
storage  expandsize                        -                                 -
storage  freeing                           0                                 -
storage  fragmentation                     81%                               -
storage  leaked                            0                                 -
storage  multihost                         off                               default
storage  feature@async_destroy             enabled                           local
storage  feature@empty_bpobj               active                            local
storage  feature@lz4_compress              active                            local
storage  feature@multi_vdev_crash_dump     enabled                           local
storage  feature@spacemap_histogram        active                            local
storage  feature@enabled_txg               active                            local
storage  feature@hole_birth                active                            local
storage  feature@extensible_dataset        active                            local
storage  feature@embedded_data             active                            local
storage  feature@bookmarks                 enabled                           local
storage  feature@filesystem_limits         enabled                           local
storage  feature@large_blocks              enabled                           local
storage  feature@large_dnode               enabled                           local
storage  feature@sha512                    enabled                           local
storage  feature@skein                     enabled                           local
storage  feature@edonr                     enabled                           local
storage  feature@userobj_accounting        active                            local
storage  unsupported@com.datto:encryption  inactive                          local

Btw. zfs history gave me the exact time of the snapshot: (utc)

2018-09-15.21:12:59 zfs snapshot storage/database@speedfix
2018-09-16.07:17:46 zfs destroy storage/database@speedfix

Test system we only tried the exact same config, kernel 4.12.5 and 0.7.0-snapshot, I first assumed it was an issue that I compiled from HEAD instead of a tag, but it now also appeared with 0.7.8.

I would really love to provide more insight, but that will be impossible with the test system being in austria and the db being in germany. I guess the only way to find it out, would be a fully reproducable test case :(

Hardware: Supermicro SYS-2028R, E5-2630v3, 128 GB DDR 4-2133 ECC + 3x Samsung 850 Pro (I'm aware that this is not really a perfect disk for that setup but at the time the only thing in the price range) - system is from 2015.

I have two of those systems, but can't remember if the previous issue was on the same on, or the second one. For my own reference, it happened on: sm02

stale[bot] commented 4 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.