Open EliahKagan opened 1 month ago
Thanks for the detailed report and for providing enough details to reproduce the issue.
To me it seems that libz-ng
might be the place to actually get a solution for the issue if it gets prioritized and is indeed the same.
This crate probably doesn't have the means to affect it, except for changing the version of the libz-ng
source code.
Lastly, gitoxide
probably has a fix already by using a different zlib implementation, at least on the platforms that are likely to fail, if known in advance.
To me it seems that
libz-ng
might be the place to actually get a solution for the issue if it gets prioritized and is indeed the same.
Yes. Assuming this is https://github.com/zlib-ng/zlib-ng/issues/1705, which I believe to be the case, there is a patch that has been reported to work, though as far as I know it has not yet been offered as a PR. This is as detailed in https://github.com/zlib-ng/zlib-ng/issues/1705#issuecomment-2177455106 and preceding comments.
This crate probably doesn't have the means to affect it, except for changing the version of the
libz-ng
source code.
When building this crate, is there a way to pass configuration variables to cmake
for libz-ng
, as if by passing -D ...
to cmake
? As noted in https://github.com/zlib-ng/zlib-ng/issues/1705#issuecomment-2177453721, -D WITH_RVV=OFF
can be passed when cross-compiling zlib-ng
for RISC-V to prevent vector instructions requiring RVV support from being emitted, and this should be effective as a workaround for incorrect detection when not cross-compiling.
This would provide a way to make this crate work while neither changing the source code of libz-ng
or using a different zlib implementation. It would also allow me to confirm to an even greater degree of certainty that this issue really is a straightforward case of https://github.com/zlib-ng/zlib-ng/issues/1705 and nothing more.
If this is not feasible, or if it is feasible but not easy, then should a feature be added here for it, or something? (But maybe this really is quite easy and I am just not aware of it.)
Lastly,
gitoxide
probably has a fix already by using a different zlib implementation, at least on the platforms that are likely to fail, if known in advance.
Achieving the effect of -D WITH_RVV=OFF
, if there is way to do it when cmake
is run during a build of this crate, should likewise allow gitoxide
to work. Furthermore, even once https://github.com/zlib-ng/zlib-ng/issues/1705, this would be helpful for actual cross compilation, such as if we make binary releases for RISC-V.
When building this crate, is there a way to pass configuration variables to
cmake
forlibz-ng
, as if by passing-D ...
tocmake
? As noted in zlib-ng/zlib-ng#1705 (comment),-D WITH_RVV=OFF
can be passed when cross-compilingzlib-ng
for RISC-V to prevent vector instructions requiring RVV support from being emitted, and this should be effective as a workaround for incorrect detection when not cross-compiling.
A great idea, this is absolutely possible!
If this is not feasible, or if it is feasible but not easy, then should a feature be added here for it, or something? (But maybe this really is quite easy and I am just not aware of it.)
The build.rs
script for zlib-ng
should allow to detect that case and conditionally pass a flag to the cmake invocation. It should be easy enough if there is the right test-system available.
Achieving the effect of
-D WITH_RVV=OFF
, if there is way to do it whencmake
is run during a build of this crate, should likewise allowgitoxide
to work. Furthermore, even once zlib-ng/zlib-ng#1705, this would be helpful for actual cross compilation, such as if we make binary releases for RISC-V.
Now I have hopes that this can be fixed here, and if you say it will remain useful even if zlib-ng
ships their fix, that's even better.
Thanks again for all your great work!
The
build.rs
script forzlib-ng
should allow to detect that case and conditionally pass a flag to the cmake invocation. It should be easy enough if there is the right test-system available.
When compiling binaries to be distributed or otherwise to be run on another system, one may want to insist that RVV instructions be emitted, or insist that they not be emitted. Especially since an upstream fix should eventually take care of the auto-detection case, I think it would be best for a change here to allow it to be manually overridden.
This would not necessarily preclude implementing auto-detection here as well. But the appraisal of "easy enough" might be an underestimate, considering that the difficulty and subtleties involved in checking this are the cause of the upstream bug. (Detection is attempted, but the result is not always correct.) Furthermore, manually overriding it is really the capability that would remain useful here even after the upstream bug is fixed.
Since arbitrary logic, including logic specific to building particular architectures, could go in builds.rs
or modules in zng
that it uses, the manual override could be by an environment variable or other external mechanism. But I wonder if doing it by feature would still be better.
Is it acceptable if I add rvv-off
and rvv-on
features, or something like that?
(Auto-detection done here could then be by an rvv-auto
feature, if later added. This would remain distinct from when no rvv-*
feature is enabled, which would use the upstream detection.)
Thanks for the update.
Is it acceptable if I add
rvv-off
andrvv-on
features, or something like that?
Yes, I'd also do it with a cargo feature, which would also have to be additive. Of course it's possible to add logic to allow --all-features
builds that don't clash, but ideally this could be so minimal that it fixes the most anticipated usecase.
I've noticed that this project introduces the nonstandard configuration name zng
, for which warnings are sometimes issued. (To be clear, that doesn't show an effective test run, it's just an example of the warnings. The warnings can also be seen on CI.) That is not specific to RISC-V.
Why was this done for that, rather than using a feature? Does it mean I should consider introducing other such nonstandard configuration names for overriding RVV detection, rather than using features? That seems like a wrong thing to do, but since I am not clear on why it was done for zng
, I am not certain. Or should it not have been done for zng
either, and should that be changed?
Of course it's possible to add logic to allow
--all-features
builds that don't clash, but ideally this could be so minimal that it fixes the most anticipated usecase.
Do you mean that the added features should be minimal and thus not try to support --all-features
on RISC-V, or that they should try to support --all-features
even on RISC-V where it would feel contradictory if the features are rvv-off
and rvv-on
but that this should be done in a minimal way?
I think that, outside of architectures where it makes a difference, the new features could be no-ops and supplying them together could be permitted, while prohibiting it on RISC-V. Whether that's the best way to do it, I am not sure.
Why was this done for that, rather than using a feature? Does it mean I should consider introducing other such nonstandard configuration names for overriding RVV detection, rather than using features? That seems like a wrong thing to do, but since I am not clear on why it was done for
zng
, I am not certain. Or should it not have been done forzng
either, and should that be changed?
I'd hope Git has information on this, as I myself joined late enough to not know anything on how things came to be, unfortunately.
Do you mean that the added features should be minimal and thus not try to support
--all-features
on RISC-V, or that they should try to support--all-features
even on RISC-V where it would feel contradictory if the features arervv-off
andrvv-on
but that this should be done in a minimal way?
All cargo-features should be additive so that --all-features
will work and have predictable results. In practice, that's not always possible, but by using just a single RVV-related flag it should naturally be additive.
I also think it would do nothing outside of its applicable platform.
Also, I'd approach this as a band-aid to fix one specific problem, and not try to exhaustively solve every conceivable use-case, to keep it simple.
I'd hope Git has information on this, as I myself joined late enough to not know anything on how things came to be, unfortunately.
I'll look into it and let you know what I find.
[...] Also, I'd approach this as a band-aid to fix one specific problem, and not try to exhaustively solve every conceivable use-case, to keep it simple.
I'll make just a feature to force RVV to be turned off, since that's the specific problem right now.
I've opened #218 to add the rvv-off
feature discussed above, though further changes may be needed, for the reasons presented there.
This strongly resembles #148 (https://github.com/Byron/gitoxide/issues/955) but affects riscv64 rather than x86_64, seems to happen every time rather than sporadically, and may be a case of a known zlib-ng bug, https://github.com/zlib-ng/zlib-ng/issues/1705.
What happens
Both
dev
andrelease
builds are affected. On a 64-bit RISC-V machine running Ubuntu 24.04 LTS, runninggix clone
begins the download but is always terminated withSIGILL
(as indicated by the message text and as can be inferred from the exit status):Passing
--trace
does not help, since it doesn't get far enough to print the traced output, and since the program is immediately being terminated by the system, rather than hitting a panic and being able to unwind. The prompt printed afterwards is interleaved with the displayed progress due to the sudden way the program was terminated. (The displayed exit status, here 132, is part of the prompt.)Getting a backtrace using
gdb
Running it in a debugger and printing a backtrace after the crash provides far more information:
This shows that the problem happens in
zlib-ng
and more specifically inlibz-ng-sys-1.1.15/src/zlib-ng/arch/riscv/adler32_rvv.c
, which through thezlib-ng
repository resolves through the git submodule to this file.Likely explanation
This looks a lot like a case of https://github.com/zlib-ng/zlib-ng/issues/1705. My kernel is old enough to trigger that bug:
Furthermore, that issue mentions the problem happening in starship, and I wonder if starship might even be triggering the crash in the same way through one of its
gix-*
dependencies.Perhaps this issue is a victim of its own success and should be closed, to become (and be referenced by) a new comment on https://github.com/zlib-ng/zlib-ng/issues/1705. But I figured I'd start by opening this, in case more is known or can be discerned here, in case it might somehow help with #148, and in case there's anything to be done about it anywhere else (such as gitoxide).
Simplified reproduction
To check if the problem is occurring, and to check that it happens in
ein
as well asgix
, thegix status
andein t h
commands can be used: