Closed ericdaltman closed 1 year ago
Not really sure what will help here so I'll add what I can think of. First of all, not using dedup or any such silliness.
Some messages from dbgmsg (many of which repeat heavily):
1532140180 dnode.c:1314:dnode_hold_impl(): error 2
1532140180 dnode.c:576:dnode_allocate(): os=ffffa04afee16800 obj=345529 txg=360 blocksize=131072 ibs=17 dn_slots=1
1532140179 dnode_sync.c:61:dnode_increase_indirection(): os=ffffa04afee16800 obj=344691, increase to 2
1532140188 dbuf.c:2269:dbuf_findbp(): error 2
These appear once starting a zfs recv:
1532140545 dsl_prop.c:147:dsl_prop_get_dd(): error 2
1532140545 zap_micro.c:628:mzap_upgrade(): upgrading obj=1544 with 7 chunks
1532140545 zap_micro.c:640:mzap_upgrade(): adding org.zfsonlinux:userobj_accounting=0
1532140545 zap_leaf.c:441:zap_leaf_lookup(): error 2
1532140545 zap_micro.c:640:mzap_upgrade(): adding com.delphix:resume_fromguid=3794124285519673276
1532140545 zap_leaf.c:441:zap_leaf_lookup(): error 2
1532140545 zap_micro.c:640:mzap_upgrade(): adding com.delphix:resume_toguid=2494501697666804117
1532140545 dbuf.c:2269:dbuf_findbp(): error 2
1532140545 zap_leaf.c:441:zap_leaf_lookup(): error 2
1532140545 spa_history.c:317:spa_history_log_sync(): txg 425 receive REDACTED/REDACTED/%recv (id 1544)
1532140545 zap.c:766:fzap_checksize(): error 22
This issue continues to plague us and could cost a whole lot of well.. my job.
Continuing to seek advice and help. Happy to provide ANY required additional data.
At this point, out of 11 datasets on the Pool, only two (one 131TB and the other 11TB) cause a hard panic and reboot on the receiving side while not affecting the sending side.
I tried to send to a stream file, transfer that file, checksum it to verify transfer, and receive it. Only difference is that it crashes quicker.
Attempted downgrade to 0.7.4. Tried dkms instead of kmod.
zstreamdump did not seem to have any issue with processing the transferred zfs stream file.
Turning on zfs_dbgmsg_enable with as many flags as possible and I see these:
zfs_ioctl.c:1388:put_nvlist(): error 12
zfs_ioctl.c:1601:zfs_ioc_pool_configs(): error 17
The last reported txg was
txg.c:509:txg_sync_thread(): waiting; tx_synced=135774 waiting=135772 dp=ffff93f46ddb6800
I'm a little at the end of my rope here.
One strange behavior I see... the drives were used for another pool before (that also crashed on this process) and I rebuilt the pool. Now when I do a 'zpool import -d /dev/disk/by-id Primary' it still sees remnants of the old pool and makes me have to type in the id to import the correct pool. If I don't use '-d', it imports the proper one with no problem.
Yikes, that's nasty.
What's zpool get all/zfs get all say on the (source and destination) pool/datasets involved, respectively?
What's the hardware/software configuration of the source and destination of these sends?
You could try rolling a kernel+modules with kASAN enabled to try and tell you what corrupted the in-memory data structure, though that is only a single level of indirection resolved (e.g. it tells you where the wild memory write happened, but not, on its own, how the pointer math ended up eating paste).
One strange behavior I see... the drives were used for another pool before (that also crashed on this process) and I rebuilt the pool. Now when I do a 'zpool import -d /dev/disk/by-id Primary' it still sees remnants of the old pool and makes me have to type in the id to import the correct pool. If I don't use '-d', it imports the proper one with no problem.
Possibly the partition table is a bit different now. You could dd if=/dev/zero the parts of the drive that are not used by the partition holding the new pool, to get rid of stale uberblocks from the old pool.
See https://github.com/zfsonlinux/zfs/issues/7656 on why import -d behaves differently.
Ah! Replies! I think I had given up on help. Thank you.
I'm remote at the moment and unable to get the details requested. However, as soon as I'm back in country I'll get you details you're asking for.
And thank you GregorKopka. That is both interesting and troubling, I'll have to figure out the specific methodology for cleaning up the dead pool (heh) information without killing my perfectly healthy pool..
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
I hit into something similar, maybe the same issue (I'll gather kernel panic logs on next time I have access to the machine in question)
Type | Version/Name |
---|---|
Distribution Name | Debian |
Distribution Version | 10.4 (stable) |
Linux Kernel | Debian 4.19.132-1 (2020-07-24) |
Architecture | x86_64 |
ZFS Version | 0.7.12-2+deb10u2 |
SPL Version | 0.7.12-2+deb10u1 |
zfs recv
seemingly received the whole snapshot (from what the network transfer statistics say), but at the very end the kernel panics. Repeated occurrence with various snapshots of one remote dataset (unfortunately one of unhandy multi-TB size), but a small one (a few hundred KB) does get received without issues. I paid attention to receive the snapshot into a non-existing dataset.
I suspect my HBA of being too impatient with my HDDs, but I don't have a replacement ATM.
I'll give the buster-backports version of zfs-dkms (0.8.4-2~bpo10+1) a run.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
Another defect where I have to stop the stale bot.
While I can't say for certain the issue described here is consistent with what's described in this https://github.com/openzfs/zfs/issues/7735#issuecomment-416628385. The root cause of those issues were resolved in the 0.7.12 tag. There was a report here of a similar issue with 0.7.12-2+deb10u2, but it would be helpful to know if this has ever been reproduced with something more current.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
System information
Describe the problem you're observing
Receiving a zfs stream randomly leads to a kernel panic and forced reboot. This has happened with the same pool (with same JBOD chassis) with a totally different head unit (with the same installed software). The pool has been destroyed and recreated with no success in the process. This same head unit with a different pool attached (and receiving from a different pool's "zfs send") has succeeded. Two totally different types of HBAs have been tried as well. The sending machine is unaffected. No networking issues have been detected. vmcore-dump included below.
Describe how to reproduce the problem
Set up a "zfs recv" and wait. Sometimes we have a successful process and other attempts, even on the same datasets, leads to a kernel panic.
Include any warning/errors/backtraces from the system logs