Closed AGI-chandler closed 1 year ago
Maybe a compiler version mismatch? Between the version the kernel was compiled with and the one used to compile the modules?
@AllKind don't think so...
$ cat /proc/version
Linux version 4.19.0-5-amd64 (debian-kernel@lists.debian.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.37-5+deb10u2 (2019-08-08)
$ gcc --version
gcc (Debian 8.3.0-6) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
...
I see a couple of people posting that reinstalling their kernel headers packages and rebuilding fixed this for them. You could try that.
@rincebrain Ok thanks... I wiped out everything including the zfs-2.1.7 source tree. Since this kernel is so old, had to define Debian snapshot repositories (very cool I might add and thank you Debian for keeping old code). Then, I reinstalled the headers, followed by the OP commands. The results are the same, though, unfortunately.
# modprobe -v zfs
insmod /lib/modules/4.19.0-5-amd64/extra/zfs/spl/spl.ko
modprobe: ERROR: could not insert 'zfs': Exec format error
# insmod /lib/modules/4.19.0-5-amd64/extra/zfs/spl/spl.ko
insmod: ERROR: could not insert module /lib/modules/4.19.0-5-amd64/extra/zfs/spl/spl.ko: Invalid module format
# tail /var/log/syslog
[...]
<datetime> <hostname> kernel: [<uptime>] module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 1, loc 000000002cf3eefb, val ffffffffc093a1f0
<datetime> <hostname> kernel: [<uptime>] module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 1, loc 0000000007ce8a9e, val ffffffffc09e61f0
Any particular reason you're running an ancient snapshot kernel? I don't have any particular reason to think a newer one would help here, just wondering since you mentioned reaching into snapshot.debian.org to get it, and buster is up to 4.19.0-23...
@rincebrain Not really, that was just the last time I did a full upgrade. I could even go to 5.10 from buster-backports, too. If a new kernel or reboot wouldn't help then I can keep enjoying my uptime.
18:56:40 up 738 days, 4:12, 1 user, load average: 0.05, 0.09, 0.04
Frankly, I don't even understand what the error means in the first place. I'll keep doing research or try going back to the old ZFS version. Maybe I should post about it in the kernel community to see if anyone there has insights?
I can't say it would or wouldn't.
I've personally come to dislike the notion of long monolithic server uptimes as a feature, for similar reasons to why people dislike having special snowflake servers with no reproducibility in their setup - the longer it's been since you checked that booting works, the more likely it is something broke, and if it breaks, you're going to be hard-pressed to reproduce every magical thing about how it worked before. But YMMV.
Can't say I've seen that problem ever. If it booted fine 738 days ago it'll boot fine again, but that may be because I practically haven't changed anything. I could maybe see how installing all kinds of updates upon updates and never rebooting might mess something up, but that seems unlikely as well since the last working kernel is always saved as a fallback in such cases... Either way, it's not like I force these sorts of situations, but it is nice to be reminded of how impeccable our electricity, hardware, and software are that provides our valuable services. ☺️
Yes, if you don't update, it's unlikely to fail on reboot. My remark was on installing updates without rebooting.
Welp, nobody seems to have a clue what causes these modprobe errors. I started to get the same errors with the QAT modules too, so I also asked Intel about it, as well as Linux-Modules mailing list (you'd think someone there would know). So that's pretty funny. It's probably for the better. This hardware deserved an upgrade after over 2 years of constantly running stale stanky code. So I gave it a new BIOS firmware, new BMC firmware, new bootloaders, new partition tables, a new kernel, upgraded the operating system, upgraded all the software packages, and reinstalled the QAT drivers. Now it's humming along with its QuickAssist engines, waiting for work! but ZFS is still a PITA! 😂
Where do I begin? Well, with the latest 2.1.8 release, I'm running into this nearly 3 year old issue where make deb
is making rpm's, sneaky Red Hatss... actually that OP was lucky, they were getting rpm's AND deb's, but not me. configure
finds alien
but then there's no mention of it afterwards during make deb
. No big deal, I just run alien -cd
on the RPM's I want to convert and install them myself. Even after installing zfs-dkms
, the modules were not inserted. Then I remembered zfs.sh
and tried it, only to be met with another ancient issue: FATAL: modpost: GPL-incompatible module zfs.ko uses GPL-only symbol 'perf_trace_buf_alloc'
. Ok that was enough of that for now, so I uninstall
ed and clean
ed up and apt remove --purge
ed the debs I installed.
Finally, I thought I'd give my other QAT+Debian bud @ioguix's idea a try and export ICP_ROOT=/usr/local/src/QAT.L.4.20.0-00001 && apt install -t bullseye-backports libnvpair3linux libuutil3linux libzfs4linux libzpool5linux zfs-dkms zfs-zed zfsutils-linux
. This really wanted to work, but it didn't. The zfs-dkms
package actually got the modules inserted, but still complained:
znvpair: module license 'CDDL' taints kernel.
Disabling lock debugging due to kernel taint
ZFS: Loaded module v2.1.7-1~bpo11+1, ZFS pool version 5000, ZFS filesystem version 5
Somehow they figured out a way to force insert it... but that was a bad idea because now the QAT is complaining when I tried zpool import
:
In the end, zpool is just hung now. Who knows what's happening to my filesystem. It seems like I'd have to file like 6 ot 7 different issues from all this. There is just not enough time in the day.
Oh yes, I'm on to something. It all depends on which freaking message one decides to pursue...
I decided to look into Cannot use PF with IOMMU enabled
coming from the QAT c6xx
firmware. I've heard of IOMMU before and kind of thought it was a basic requirement for a computer to work, but the message implies it can be disabled. and what is PF? That took a while of searching around, but finally found it refers to Physical Function
, which sounds rather critical.
After learning more than I probably ever need to know about IOMMU, I found that it was related to Intel Virtualization Technology, which I swear I disabled in the BIOS since we don't use any virtualization on this computer. However, after double-checking, it was enabled! After disabling that and making sure system still booted, I also found the kernel command line option iommu=off
can be added, so I added that too and rebooted.
Then I reconfigured and rebuilt the QAT drivers and once again tried export ICP_ROOT=/usr/local/src/QAT.L.4.20.0-00001 && apt install -t bullseye-backports libnvpair3linux libuutil3linux libzfs4linux libzpool5linux zfs-dkms zfs-zed zfsutils-linux
. The engines seem to be happy now, actually! zpool import -a
worked and zpool upgrade
worked. The zfs/qat
cksum_requests
and cksum*bytes
counters have registered some activity! Now I started a backup task, it's reading through the data still, the cksum
counters continue going up,
cksum_requests 4 173624
cksum_total_in_bytes 4 803618816
just going to wait until the task finds some new data to backup and write to the pool, and make sure the compression counters go up, too... humm still waiting, it's a lot of data to read through! got to go afk now but so far everything looks good!
Update 1/26: Yes looks like it is all running finally!
# cat /proc/spl/kstat/zfs/qat
40 1 0x01 17 4624 33823350553531 60993584827354
name type data
comp_requests 4 581423
comp_total_in_bytes 4 76208275456
comp_total_out_bytes 4 1646911968
decomp_requests 4 3
decomp_total_in_bytes 4 86016
decomp_total_out_bytes 4 393216
dc_fails 4 0
encrypt_requests 4 0
encrypt_total_in_bytes 4 0
encrypt_total_out_bytes 4 0
decrypt_requests 4 0
decrypt_total_in_bytes 4 0
decrypt_total_out_bytes 4 0
crypt_fails 4 0
cksum_requests 4 27500
cksum_total_in_bytes 4 205094912
cksum_fails 4 0
@AGI-chandler In the future, when you stumble on errors like "FATAL: modpost: GPL-incompatible module zfs.ko uses GPL-only symbol 'perf_trace_buf_alloc'" or similar issues having to do with Linux developers changing more and more of the EXPORT_SYMBOL instances to EXPORT_SYMBOL_GPL, do yourself a big favor and simply change CDDL to GPL before building. For more info see: https://github.com/openzfs/zfs/issues/14555 https://github.com/openzfs/zfs/issues/11357
You have every right to do that, also Linux developers shouldn't be upset/bothered with it since only closed source proprietary drivers would have issue with slapping GPL to their module. CDDL is COPYLEFT just like GPL, and since you are not "distributing", there's zero legal issues.
Thanks @jittygitty I'll definitely remember that because you're right I couldn't care less what the license is, I will make the code work for us. I'm pretty sure our educational use of the code is further protected by "fair use" provisions as well.
In the end though, in regards to the original error this issue was opened for: no one seems to actually know what is the cause nor what is the solution, other than shutting down the system and rebooting. I even asked the linux kernel modules mailing list, I thought for sure someone there might know but nope 😂
I think the computer was just tired of running stinky old code so I gave it a bunch of new code everywhere, BIOS firmware, Linux kernel, OS, all the software including ZFS, and didn't need to manually compile ZFS anymore since Debian zfs-dkms package can automatically pick up the QAT drivers and add that functionality to the modules. Yes it's been humming along for a short 23 days now and already processed nearly 42 TB of written data oh behalf of ZFS!
Anyway I'm pretty sure most of the errors in this issue have been addressed so guess I can close this.
System information
Describe the problem you're observing
Thought I'd try my luck with upgrading, and it's not in my favor. I've been at it for several hours now and I can't find anyone else out there with these same error messages and environment. After exporting the pool, unloading the old modules, uninstalling the old zfs, downloading the new zfs, configuring, building and installing the debs, I now cannot load the modules. For example:
and the syslog shows:
<datetime> <hostname> kernel: [<uptime>] module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 1, loc 000000002cf3eefb, val ffffffffc093a1f0
Describe how to reproduce the problem
# ./autogen.sh
# ./configure --enable-systemd
# make -j16 deb-utils deb-kmod
# dpkg -i *.deb
# modprobe zfs
modprobe: ERROR: could not insert 'zfs': Exec format error
#
Include any warning/errors/backtraces from the system logs
# tail /var/log/syslog
[...]
<datetime> <hostname> kernel: [<uptime>] module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 1, loc 000000002cf3eefb, val ffffffffc093a1f0
#