Closed thoro closed 4 years ago
The snapshotted files get corrupted or the live files?
Live files.
New update:
I started zfs send immediatly after taking the snapshot before, our DB showed immediate corruption errors in indices and in one table. We recreated the indices to at least keep operating. (there's enough backups etc)
After the zfs send was canceled (connection to server interrupted) the affected table went back to normal without a corruption error (I had a query were I could consistently reproduce the error) -> maybe postgres did something, or maybe zfs not sure.
BTW modprobe parameters:
options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
options zfs zfs_vdev_sync_write_min_active=64
options zfs zfs_vdev_sync_write_max_active=128
options zfs zfs_vdev_sync_read_min_active=64
options zfs zfs_vdev_sync_read_max_active=128
options zfs zfs_vdev_async_read_min_active=64
options zfs zfs_vdev_async_read_max_active=128
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=5
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=64
options zfs zfs_prefetch_disable=1
options zfs zfs_compressed_arc_enabled=0
Maybe there's some old option which shouldn't be used anymore?
@Thoro To be clear here, are you saying that the running PostgreSQL instance on system A whose database was snapshoted began to experience corruption when you began a zfs send
and that things went back to normal when you cancelled the send?
I am sure that you have an issue, but I want to be absolutely certain that the corruption was not detected by a PostgreSQL instance on the receiving side. That scenario would indicate a regression in PostgreSQL that breaks ACID compliance.
Presently, I cannot imagine a scenario in which a zfs send
should affect the active filesystem. There might be some extremely subtle nuance in our interaction with the page cache, but it is too early to tell. I notice that you are not using the 3.10 kernel that CentOS normally uses. Would you be able to try switching back to that and seeing whether the issues persist? CentOS' 3.10 kernel is far better tested and having my hands full with #7423, I cannot promise to have time to do much here in the short term.
You are running an older version of ZoL HEAD. Is there any specific reason for that? If not and your pool is not using any newer features than what are supported in 0.7.x, you might want to switch to 0.7.8. I have been sufficiently out of the loop that I am not familiar enough with the development of HEAD over the past several months to know if there is anything like this that has been fixed, but having you on either 0.7.8 or a more recent HEAD would be helpful to the ZFSOnLinux team in eliminating variables that we need to consider.
The same goes for CentOS' 3.10 kernel versus 4.12.5. If things are fine on 3.10, but have issues on 4.12.5, we would want to look at what is different between the two kernels in the code paths that PostgreSQL uses. Also, I want to be clear that we do support 4.12.5, but knowing whether 3.10 is affected would help reduce the amount of developer time needed on this and enable us to figure out what is going wrong sooner.
Also, since I have my hands full with #7423, other project members will likely handle this one. This is just some initial triage to help them take care of this faster.
@ryao first off, it's not an issue at all for our system, we don't rely on the snapshotting feature, but in case we find out it's a bug we should figure why it's there! I will try to create a test system if I can reproduce.
I'm not sure if zfs snapshot or zfs send had any effect, but I noticed that our logs were showing the corruptions a few minutes after the snapshot was taken. The snapshot was on the postgres system showing the corruption errors. The target of the zfs send was a different server (sending via mbuffer) - this one was ideling.
To answer the other questions:
Regarding kernel -> I will setup the test system and see if I can reproduce this, hopefully pg_bench results in the same load pattern
We couldn't reproduce on a test system, but yesterday I switched again the databases and tried to restore the second system from a zfs snapshot made immediately afterwards.
Seconds after the snapshot was made, Postgres Indices were corrupted with the message: "SQLSTATE[XX001]: Data corrupted: 7 ERROR: invalid page in block 36451 of relation base/16400/112770"
At first I thought it's just the indices but also data files were affected. Even though would revise my last statement with the zfs send, something else must of interferred back then, now I would say it's purely the snapshot command - after stopping zfs send and deleting the snapshot nothing was fixed.
This time I was using a released version, and not a HEAD snapshot.
filename: /lib/modules/4.12.5/extra/zfs/zfs/zfs.ko
version: 0.7.8-1
license: CDDL
author: OpenZFS on Linux
description: ZFS
srcversion: 6DFACEA9ADD42F919B32C68
depends: spl,znvpair,zcommon,zunicode,zavl,icp
vermagic: 4.12.5 SMP mod_unload modversions
It's not clear to me what you're trying to describe here.
If I understand correctly, you're saying that
According to Postgres's own documentation on backing up the filesystem:
The database server must be shut down in order to get a usable backup.
It also observes:
An alternative file-system backup approach is to make a “consistent snapshot” of the data directory, if the file system supports that functionality (and you are willing to trust that it is implemented correctly). The typical procedure is to make a “frozen snapshot” of the volume containing the database, then copy the whole data directory (not just parts, see above) from the snapshot to a backup device, then release the frozen snapshot. This will work even while the database server is running. However, a backup created in this way saves the database files in a state as if the database server was not properly shut down; therefore, when you start the database server on the backed-up data, it will think the previous server instance crashed and will replay the WAL log. This is not a problem; just be aware of it (and be sure to include the WAL files in your backup). If your database is spread across multiple file systems, there might not be any way to obtain exactly-simultaneous frozen snapshots of all the volumes. For example, if your data files and WAL log are on different disks, or if tablespaces are on different file systems, it might not be possible to use snapshot backup because the snapshots must be simultaneous. Read your file system documentation very carefully before trusting the consistent-snapshot technique in such situations.
You haven't said anything about the layout of the filesystems or the Postgres data locations, so I can't begin to guess which of these applies, but since our documentation for performance says you should put the WAL and data on different datasets (for perfectly good reasons), you would need to use snapshot -r to get all the filesystems in the pool that you're trying to back up to snapshot at precisely the same time, and the same caveats about Postgres treating it like a live system whose power you pulled still apply.
I'm aware of all of this, but the data corruption affects system1 and pool1, therefore all suggestions regarding postgres backuping etc. are irrelevant, the file system shouldn't corrupt when taking a snapshot, that's the main staple of ZFS...
system2 and pool2 also have these data errors in the zfs received filesystem.
The structure is as easy as it gets:
Single directory, single file system, 16K Block size, lz4 compression
Here all settings, maybe you see something obviously wrong:
storage/database type filesystem -
storage/database creation Thu Apr 19 12:22 2018 -
storage/database used 629G -
storage/database available 254G -
storage/database referenced 629G -
storage/database compressratio 2.22x -
storage/database mounted yes -
storage/database quota none default
storage/database reservation none default
storage/database recordsize 16K inherited from storage
storage/database mountpoint /data/fast/creamdata-db04/data/ local
storage/database sharenfs off default
storage/database checksum on default
storage/database compression lz4 inherited from storage
storage/database atime off inherited from storage
storage/database devices on default
storage/database exec on default
storage/database setuid on default
storage/database readonly off default
storage/database zoned off default
storage/database snapdir hidden default
storage/database aclinherit restricted default
storage/database createtxg 1159021 -
storage/database canmount on default
storage/database xattr sa inherited from storage
storage/database copies 1 default
storage/database version 5 -
storage/database utf8only off -
storage/database normalization none -
storage/database casesensitivity sensitive -
storage/database vscan off default
storage/database nbmand off default
storage/database sharesmb off default
storage/database refquota none default
storage/database refreservation none default
storage/database guid 8214081110063784152 -
storage/database primarycache all default
storage/database secondarycache all default
storage/database usedbysnapshots 0B -
storage/database usedbydataset 629G -
storage/database usedbychildren 0B -
storage/database usedbyrefreservation 0B -
storage/database logbias latency default
storage/database dedup off default
storage/database mlslabel none default
storage/database sync standard default
storage/database dnodesize legacy default
storage/database refcompressratio 2.22x -
storage/database written 629G -
storage/database logicalused 1.36T -
storage/database logicalreferenced 1.36T -
storage/database volmode default default
storage/database filesystem_limit none default
storage/database snapshot_limit none default
storage/database filesystem_count none default
storage/database snapshot_count none default
storage/database snapdev hidden default
storage/database acltype off default
storage/database context none default
storage/database fscontext none default
storage/database defcontext none default
storage/database rootcontext none default
storage/database relatime off default
storage/database redundant_metadata all default
storage/database overlay off default
Additionally zpool:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x600304801947ff012215da8608cc28f2-part3 ONLINE 0 0 0
wwn-0x600304801947ff012215da9409aede95 ONLINE 0 0 0
wwn-0x600304801947ff012215da6e076390fd-part3 ONLINE 0 0 0
Settings zpool:
zpool get all storage
NAME PROPERTY VALUE SOURCE
storage size 912G -
storage capacity 69% -
storage altroot - default
storage health ONLINE -
storage guid 10731124542107270489 -
storage version - default
storage bootfs - default
storage delegation on default
storage autoreplace off default
storage cachefile - default
storage failmode wait default
storage listsnapshots off default
storage autoexpand off default
storage dedupditto 0 default
storage dedupratio 1.00x -
storage free 283G -
storage allocated 629G -
storage readonly off -
storage ashift 12 local
storage comment - default
storage expandsize - -
storage freeing 0 -
storage fragmentation 81% -
storage leaked 0 -
storage multihost off default
storage feature@async_destroy enabled local
storage feature@empty_bpobj active local
storage feature@lz4_compress active local
storage feature@multi_vdev_crash_dump enabled local
storage feature@spacemap_histogram active local
storage feature@enabled_txg active local
storage feature@hole_birth active local
storage feature@extensible_dataset active local
storage feature@embedded_data active local
storage feature@bookmarks enabled local
storage feature@filesystem_limits enabled local
storage feature@large_blocks enabled local
storage feature@large_dnode enabled local
storage feature@sha512 enabled local
storage feature@skein enabled local
storage feature@edonr enabled local
storage feature@userobj_accounting active local
storage unsupported@com.datto:encryption inactive local
Disks are Samsung 850 Pro, behind a LSI SMC3108, no cache, no raid devices defined, pure JBOD. Ashift is corrected already to 12, since the 850 Pro seem to have a garbarge collection issue when it's set to 9 (can't find the source now, but I can vouch that it produces checksum errors after a while)
What do you mean, it affects pool1 on system1?
You said you have a live Postgres server running there, doing reads and writes. What series of commands are you doing, exactly, when it reports problems, and what problems does it report?
Does the problem still go away if you destroy the snapshot you took and don't rebuild the indices?
Does this only happen if you do a zfs send after taking a snapshot, or is it merely taking the snapshot that's sufficient?
Does this still happen if you don't have zfs_compressed_arc_enabled=0 ? (I think you would need to actually reboot with that setting change removed to be certain of the answer to this question, since it's not going to just drop the entire ARC contents if it allows you to change that at runtime.) (I doubt this is the problem, but there are very few cases I've seen with people using that setting, so I would be less surprised than otherwise if there was some uncommon problem that crops up with it...)
I mean, the Postgres 10 server on system1 pool1 reports data corruption errors posted above.
zfs snapshot storage/database@speedfix zfs send -c storage/database@speedfix | mbuffer -s 128k -m 1G -O 10.0.99.11:1337
I did that around midnight my time (vienna, sadly exact timing is hard to say, I really have to enable timestamps for the console.). First errors appeared at 2018-09-15 22:41:05.644 UTC -> around 00:40 vienna time. -> okay, not immediately after, I was confused by time zones (yeah) exact error:
[2018-09-15 22:41:05.644 UTC|10.0.99.32(43983)]ERROR: invalid page in block 155421 of relation base/16400/1724155
Probably it's important to note that during the zfs send some heavy processing was ongoing (normal operations during night).
Deleting the snapshot did not change the corrupted files.
It's hard to say if the zfs send has an impact, since I was sleeping and it was running. But it took some time to be corrupted possibly there's really a connection.
I will remove the setting, but since it's a live db I'm not really keen on breaking it again after the backups are restored - and as I said I couldn't reproduce it with a test system. -> would a unload of the zfs module and reload update the setting without a reboot?
Yes, unloading the modules and reloading them would drop that information.
ZFS isn't reporting any errors at any point? What do zpool status and zpool get all say?
Did you try both the kernel you're running on the production system and the stock CentOS 7 kernel on your test system?
If you can't reproduce it on the test system, it might make sense to make the test system serve the data if you want to debug it further. ("We" would obviously prefer to figure out what's wrong, but you obviously would like it to Just Work.)
I can't seem to find any suggestion of this being a problem anyone's seen on ZFS before - what's the hardware this is running on, does it have a BMC you could ask if there were any exciting hardware noises from, does it have ECC RAM, if you run memtest for a while does it pass, etc? (Postgres's docs on corruption are suggesting that most often it's caused by hardware faults, though I'm still pondering how that would arise without ZFS making unhappy noises.)
Okay - great, can do that.
No errors what so ever:
zpool status
pool: storage
state: ONLINE
scan: scrub repaired 0B in 2h10m with 0 errors on Thu Jul 26 19:50:53 2018
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x600304801947ff012215da8608cc28f2-part3 ONLINE 0 0 0
wwn-0x600304801947ff012215da9409aede95 ONLINE 0 0 0
wwn-0x600304801947ff012215da6e076390fd-part3 ONLINE 0 0 0
errors: No known data errors
zpool get all storage
NAME PROPERTY VALUE SOURCE
storage size 912G -
storage capacity 69% -
storage altroot - default
storage health ONLINE -
storage guid 10731124542107270489 -
storage version - default
storage bootfs - default
storage delegation on default
storage autoreplace off default
storage cachefile - default
storage failmode wait default
storage listsnapshots off default
storage autoexpand off default
storage dedupditto 0 default
storage dedupratio 1.00x -
storage free 282G -
storage allocated 630G -
storage readonly off -
storage ashift 12 local
storage comment - default
storage expandsize - -
storage freeing 0 -
storage fragmentation 81% -
storage leaked 0 -
storage multihost off default
storage feature@async_destroy enabled local
storage feature@empty_bpobj active local
storage feature@lz4_compress active local
storage feature@multi_vdev_crash_dump enabled local
storage feature@spacemap_histogram active local
storage feature@enabled_txg active local
storage feature@hole_birth active local
storage feature@extensible_dataset active local
storage feature@embedded_data active local
storage feature@bookmarks enabled local
storage feature@filesystem_limits enabled local
storage feature@large_blocks enabled local
storage feature@large_dnode enabled local
storage feature@sha512 enabled local
storage feature@skein enabled local
storage feature@edonr enabled local
storage feature@userobj_accounting active local
storage unsupported@com.datto:encryption inactive local
Btw. zfs history gave me the exact time of the snapshot: (utc)
2018-09-15.21:12:59 zfs snapshot storage/database@speedfix
2018-09-16.07:17:46 zfs destroy storage/database@speedfix
Test system we only tried the exact same config, kernel 4.12.5 and 0.7.0-snapshot, I first assumed it was an issue that I compiled from HEAD instead of a tag, but it now also appeared with 0.7.8.
I would really love to provide more insight, but that will be impossible with the test system being in austria and the db being in germany. I guess the only way to find it out, would be a fully reproducable test case :(
Hardware: Supermicro SYS-2028R, E5-2630v3, 128 GB DDR 4-2133 ECC + 3x Samsung 850 Pro (I'm aware that this is not really a perfect disk for that setup but at the time the only thing in the price range) - system is from 2015.
I have two of those systems, but can't remember if the previous issue was on the same on, or the second one. For my own reference, it happened on: sm02
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
System information
Describe the problem you're observing
When creating a snapshot with zfs snapshot, while a high iops Postgres 10.2 instance is writing data to the file system, files get corrupted.
I'm not sure if this is because of the non-release version number, if a commit is known, please reference it.
Describe how to reproduce the problem
I can't easily say how to reproduce it, but it happened already at least 3 times whenever a zfs snapshot was taken.
Include any warning/errors/backtraces from the system logs