Closed jrmuizel closed 5 years ago
Thanks. I hadn't noticed that stdsimd
had gained NEON support.
To my surprise, there are no longer distinct boolean vector types in stdsimd
. They seem to be gone from WebAssembly SIMD, too!
Yet, I don't see analogs for the all()
and any()
methods that the boolean vectors had in the simd
crate.
I guess I should find out the rationale for the change and what the design intent is for replacements for all()
and any()
.
@hsivonen The current plan is to be as conservative as possible. I don't think the OP, "stdsimd is the replacement for simd," is accurate as of present. My suspicion is that the next phase is probably going to be something like, "stable, but inconvenient portability story."
Portable SIMD got merged in to stdsimd
, so I'm going to see if encoding_rs can now be ported over.
This is happening on a branch for now.
Swapping out the dependency was pretty easy for the portable part. (I didn't switch the intrinsics over yet.) I added a Simd
trait with the associated lane type on the encoding_rs
side in order to declare the signature for shuffles.
On x86_64, performance regressed for everything the depends heavily on any()
and all()
. From prior stdsimd
source inspection, this was to be expected, since stdsimd
computes unnecessary bits if the LLVM optimizer isn't smart enough not to compute them. I haven't looked at the disassembly yet, but it seems to me that either stdsimd
needs to allow for architecture-specific manually-written any()
/all()
/none()
or LLVM needs to get more complex pattern matching in the optimizer.
Yesterday, I failed to notice that in stdsimd
comparing two u16x8
s yields b8x8
and there is no b16x8
. Commented on the stdsimd
issue.
stdsimd
now has mask vectors which are like simd
crate's boolean vectors. The stdsimd
branch of encoding_rs compiles again and the x86_64 asm for any()
and all()
is now a bit better than on March 8, but it's still worse than what the simd
crate produces.
any news here?
No news yet. I was away for a bit and will need to catch up on the situation regarding any()
/all()
inlining with stdsimd
, especially in the ARMv7+NEON case.
@hsivonen packed_simd fixed all of those issues a while ago, and provides many many new features (shuffles with run-time indices, portable vector gather/scatters, ...).
Let me know if that isn't the case, I think the simd crate can pretty much be removed. All its features are now available in packed_simd
, all examples have been ported, etc.
@gnzlbg, can I now get fast boolean reductions on ARMv7+NEON without having the standard library compiled with NEON enabled? I.e. do I need to revive that PR and debug the Android Docker issue?
You can if you enable the coresimd
feature of the packed_simd
crate (it will use coresimd
instead of core::arch
for the intrinsics, and coresimd
can be recompiled by user code without issues, workaround having to recompile the std
library).
Note, however, that this uses a particular released version of coresimd
in crates.io, which might break at some point due to changes on nightly. If that ever happens, you can just report it in the stdsimd repo so that we do a new release, and in the mean time, patch it to use the stdsimd
git repo directly.
I just released version 0.2
of packed_simd
, you should prefer that over the 0.1
version, which did not allow doing that (because it relied on coresimd
as a git dependency, which had to be disabled for crates.io
).
I.e. do I need to revive that PR and debug the Android Docker issue?
Yeah, you should (its almost there, and if you have access to a linux workstation debugging locally shouldn't be too hard). This is a short term workaround that will never work on stable.
I've ported encoding_rs to packed_simd on a branch.
At least on x86_64 (Haswell desktop i7), there are repeatable performance differences. Here's the output of per-benchmark best results from 4 simd
runs vs. per-benchmark best results from 4 packed_simd
runs with threshold of inclusion at 2 percentage points or larger difference:
name simd.txt ns/iter packed.txt ns/iter diff ns/iter diff % speedup
bench_decode_to_string_tr 79,661 (3677 MB/s) 77,881 (3761 MB/s) -1,780 -2.23% x 1.02
bench_decode_to_utf16_en_windows_1252 68,420 (10391 MB/s) 66,839 (10637 MB/s) -1,581 -2.31% x 1.02
bench_decode_to_utf16_zh_cn_utf_16be 36,308 (15063 MB/s) 37,074 (14751 MB/s) 766 2.11% x 0.98
bench_mem_convert_latin1_to_str_1 3,642 (137 MB/s) 3,510 (142 MB/s) -132 -3.62% x 1.04
bench_mem_convert_latin1_to_utf8_1 1,823 (274 MB/s) 1,697 (294 MB/s) -126 -6.91% x 1.07
bench_mem_convert_latin1_to_utf8_3 2,705 (554 MB/s) 2,538 (591 MB/s) -167 -6.17% x 1.07
bench_mem_convert_str_to_utf16_ascii_1 1,730 (289 MB/s) 1,821 (274 MB/s) 91 5.26% x 0.95
bench_mem_convert_str_to_utf16_ascii_16 2,067 (3870 MB/s) 2,130 (3755 MB/s) 63 3.05% x 0.97
bench_mem_convert_utf16_to_latin1_lossy_1 846 (1182 MB/s) 989 (1011 MB/s) 143 16.90% x 0.86
bench_mem_convert_utf16_to_latin1_lossy_1000 15,605 (64082 MB/s) 16,904 (59157 MB/s) 1,299 8.32% x 0.92
bench_mem_convert_utf16_to_latin1_lossy_15 3,910 (3836 MB/s) 4,001 (3749 MB/s) 91 2.33% x 0.98
bench_mem_convert_utf16_to_latin1_lossy_16 1,823 (8776 MB/s) 1,949 (8209 MB/s) 126 6.91% x 0.94
bench_mem_convert_utf16_to_latin1_lossy_3 1,330 (2255 MB/s) 1,371 (2188 MB/s) 41 3.08% x 0.97
bench_mem_convert_utf16_to_str_ascii_16 4,858 (3293 MB/s) 4,671 (3425 MB/s) -187 -3.85% x 1.04
bench_mem_convert_utf16_to_utf8_ascii_1 2,928 (341 MB/s) 2,987 (334 MB/s) 59 2.02% x 0.98
bench_mem_convert_utf16_to_utf8_ascii_16 3,353 (4771 MB/s) 3,443 (4647 MB/s) 90 2.68% x 0.97
bench_mem_convert_utf8_to_latin1_lossy_30 6,051 (2478 MB/s) 6,189 (2423 MB/s) 138 2.28% x 0.98
bench_mem_convert_utf8_to_utf16_ascii_1 5,670 (88 MB/s) 5,791 (86 MB/s) 121 2.13% x 0.98
bench_mem_copy_ascii_to_ascii_1 923 (541 MB/s) 1,049 (476 MB/s) 126 13.65% x 0.88
bench_mem_copy_ascii_to_ascii_1000 16,087 (31080 MB/s) 20,463 (24434 MB/s) 4,376 27.20% x 0.79
bench_mem_copy_ascii_to_ascii_15 4,893 (1532 MB/s) 4,544 (1650 MB/s) -349 -7.13% x 1.08
bench_mem_copy_ascii_to_ascii_16 945 (8465 MB/s) 973 (8221 MB/s) 28 2.96% x 0.97
bench_mem_copy_ascii_to_ascii_3 1,254 (1196 MB/s) 1,286 (1166 MB/s) 32 2.55% x 0.98
bench_mem_copy_ascii_to_ascii_30 5,151 (2912 MB/s) 4,536 (3306 MB/s) -615 -11.94% x 1.14
bench_mem_copy_basic_latin_to_ascii_1 940 (1063 MB/s) 988 (1012 MB/s) 48 5.11% x 0.95
bench_mem_copy_basic_latin_to_ascii_16 1,340 (11940 MB/s) 1,407 (11371 MB/s) 67 5.00% x 0.95
bench_mem_copy_basic_latin_to_ascii_3 1,422 (2109 MB/s) 1,458 (2057 MB/s) 36 2.53% x 0.98
bench_mem_ensure_utf16_validity_ascii_16 2,962 (5401 MB/s) 3,090 (5177 MB/s) 128 4.32% x 0.96
bench_mem_ensure_utf16_validity_ascii_30 5,429 (5525 MB/s) 5,277 (5685 MB/s) -152 -2.80% x 1.03
bench_mem_ensure_utf16_validity_bmp_16 2,962 (5401 MB/s) 3,090 (5177 MB/s) 128 4.32% x 0.96
bench_mem_ensure_utf16_validity_bmp_30 5,426 (5528 MB/s) 5,276 (5686 MB/s) -150 -2.76% x 1.03
bench_mem_ensure_utf16_validity_latin1_16 2,962 (5401 MB/s) 2,838 (5637 MB/s) -124 -4.19% x 1.04
bench_mem_ensure_utf16_validity_latin1_30 5,406 (5549 MB/s) 5,277 (5685 MB/s) -129 -2.39% x 1.02
bench_mem_is_basic_latin_false_16 1,446 (11065 MB/s) 1,323 (12093 MB/s) -123 -8.51% x 1.09
bench_mem_is_basic_latin_false_30 1,810 (16574 MB/s) 1,934 (15511 MB/s) 124 6.85% x 0.94
bench_mem_is_basic_latin_true_1000 14,091 (70967 MB/s) 16,297 (61360 MB/s) 2,206 15.66% x 0.86
bench_mem_is_str_latin1_true_1 643 (777 MB/s) 709 (705 MB/s) 66 10.26% x 0.91
bench_mem_is_str_latin1_true_1000 28,798 (17362 MB/s) 21,069 (23731 MB/s) -7,729 -26.84% x 1.37
bench_mem_is_str_latin1_true_15 3,124 (2400 MB/s) 3,772 (1988 MB/s) 648 20.74% x 0.83
bench_mem_is_str_latin1_true_3 1,422 (1054 MB/s) 917 (1635 MB/s) -505 -35.51% x 1.55
bench_mem_is_str_latin1_true_30 3,623 (4140 MB/s) 3,550 (4225 MB/s) -73 -2.01% x 1.02
bench_mem_is_utf16_latin1_true_1000 13,932 (71777 MB/s) 14,512 (68908 MB/s) 580 4.16% x 0.96
bench_mem_is_utf8_latin1_false_1000 1,041 (480307 MB/s) 975 (512820 MB/s) -66 -6.34% x 1.07
bench_mem_is_utf8_latin1_false_16 1,041 (7684 MB/s) 974 (8213 MB/s) -67 -6.44% x 1.07
bench_mem_is_utf8_latin1_false_30 1,041 (14409 MB/s) 974 (15400 MB/s) -67 -6.44% x 1.07
bench_mem_is_utf8_latin1_true_15 4,761 (1575 MB/s) 5,403 (1388 MB/s) 642 13.48% x 0.88
bench_mem_is_utf8_latin1_true_16 4,488 (1782 MB/s) 4,890 (1635 MB/s) 402 8.96% x 0.92
bench_mem_is_utf8_latin1_true_3 915 (1639 MB/s) 1,806 (830 MB/s) 891 97.38% x 0.51
bench_mem_utf16_valid_up_to_ascii_3 2,644 (1134 MB/s) 2,579 (1163 MB/s) -65 -2.46% x 1.03
Many of the changes are with buffer sizes that are so short that SIMD code isn't used, which suggests flips in branch prediction heuristics or something like that.
The actually worrying cases are these:
bench_mem_copy_ascii_to_ascii_1000 16,087 (31080 MB/s) 20,463 (24434 MB/s) 4,376 27.20% x 0.79
bench_mem_is_basic_latin_true_1000 14,091 (70967 MB/s) 16,297 (61360 MB/s) 2,206 15.66% x 0.86
bench_mem_is_utf16_latin1_true_1000 13,932 (71777 MB/s) 14,512 (68908 MB/s) 580 4.16% x 0.96
Oh, and those at at opt_level=2
, because that's what Firefox builds with.
Which packed_simd
features are you enabling ? Which target? Which RUSTFLAGS
?
Could you extract the code of those three benchmarks into its own crate ?
Also the worst benchmark appears to be bench_mem_is_utf8_latin1_true_3
I've left a couple of comments in your branch, there is one suspicious function there that might be causing the regression, we should take a closer look at that comparing the output of the simd and packed_simd crates for the version with and without select (without select, both crates should generate exactly the same code).
Which packed_simd features are you enabling ? Which target? Which RUSTFLAGS ?
The above numbers were with the into_bits
feature enabled, x86_64-unknown-linux-gnu
target, -C opt_level=2
in RUSTFLAGS
on Ubuntu 18.04 on Intel Core i7-4770 @ 3.40 GHz.
Could you extract the code of those three benchmarks into its own crate ?
I'll try to minimize the test case.
Also the worst benchmark appears to be
bench_mem_is_utf8_latin1_true_3
Indeed. That benchmark shouldn't even run SIMD code, because 3 < 16, which is why I speculated about prediction heuristics flipping or something.
I've left a couple of comments in your branch
Thanks!
there is one suspicious function there that might be causing the regression
The suspicious function affects only x-user-defined
to UTF-16 decoding and is performance-neutral. (The change is based on previous advice from you. :-)
I'll try to minimize the test case.
Thank you! Feel free to open an issue in packed_simd
and we'll look at it.
On aarch64 (ThunderX), the fluctuation of the numbers is a bit larger, so there's are more changes in both directions with the 2-percentage-point threshold. I'm not too worried about that.
Instead of bench_mem_is_utf8_latin1_N
with small N regressing, the closely related bench_mem_is_str_latin1_true_N
with small N regresses.
(These are with the horizontal max intrinsics instead of the portable horizontal max operations.)
@hsivonen could you try again with packed_simd
master branch for both x86 and aarch64?
Maybe there is some inlining issue / target-feature issue popping up, since the simd
crate uses platform-intrinsics
and the packed_simd
crate uses core::arch
, so maybe RUSTFLAGS
have something to do with this. As in, maybe the simd
crate is using a target feature that it shouldn't be allow to use for a particular CPU and that just happens to work because the CPUs that you are trying do support the feature? You could also try with RUSTFLAGS="-C target-cpu=native"
and see how they compare, and maybe then try nailing things down with RUSTFLAGS="-C target-feature=+sse4.2"
and see what happens.
On x86_64
the regressions go away when building with the coresimd
feature and without RUSTFLAGS
. Up next: Figuring out which one of those makes the difference.
It seems that opt_level=2
is the problem. With or without coresimd
feature flag affects which tests regress, though.
Sorry about the delay. After updating packed_simd
(but staying on rustc
from the turn of the month, because packaging for clippy
blocks updates), I still see perf regressions at opt_level=2
with encoding_bench
(and now also a regression of bench_mem_utf16_valid_up_to_ascii_1000
at opt_level=3
). However, when trying to create a smaller test case for the opt_level=2
regressions by extracting the code to a single-file crate, the regressions go away.
Still more to investigate.
Meanwhile: It appears that cargo
is getting the capability to set per-crate opt_level
, so maybe this is going to be solved by building encoding_rs
at opt_level=3
in Firefox. OTOH, a build system peer considered packed_simd
's compatibilily policy with only the latest nightly as a blocker, so it seems that migration from simd
to packed_simd
has to wait until packed_simd
makes it into the standard library.
OTOH, a build system peer considered packed_simd's compatibilily policy with only the latest nightly as a blocker, so it seems that migration from simd to packed_simd has to wait until packed_simd makes it into the standard library.
I thought that was simd
's policy also? At least, in the past, IIRC, if a nightly came out that broke simd
, we would fix it without necessarily keeping compatibility with older nightlies.
It may be that packed_simd
makes this more of a problem in practice since it is being actively developed.
I thought that was simd's policy also?
It is simd
's crate policy and core::arch::{arm, aarch64}
policy as well.
However, when trying to create a smaller test case for the
opt_level=2
regressions by extracting the code to a single-file crate, the regressions go away.
Specifically, the code using simd
and the code using packed_simd
compile to identical assembly except for label naming.
I thought that was
simd
's policy also?
In practice, simd
isn't being developed further, so it doesn't adopt new Rust features that would require a new compiler version and, OTOH, rustc
hasn't broken the relevant parts of simd
in a long while.
What's the practical breakage situation with packed_simd
?
What's the practical breakage situation with packed_simd?
The library cannot build on stable Rust, only on nightly. It is still being developed, with new features, bug fixes, and performance fixes being added every now and then. Sometimes these require patching rustc, and we tend to then start using the features as soon as the next nightly is released, and until now the only stakeholders that complain are the rand
developers, but we tend to let them know when breaking changes happen and we help with the PR required to upgrade. This isn't too hard because rand
unstable
follows the latest nightly anyways.
If a bigger stake holder (e.g. servo / firefox) with stricter "unstable stability requirements" starts depending on packed_simd
, then we'll have to adapt our workflow around those requirements.
For example, we could add a CI build bot to test the nightly version used by stakeholders, and add any new features that would break that behind a feature gate. Once all stakeholders upgrade, then we could remove those feature gates.
The library cannot build on stable Rust, only on nightly.
Indeed, packed_simd
doesn't build with 1.29.2 even with RUSTC_BOOTSTRAP=1
.
Sometimes these require patching rustc, and we tend to then start using the features as soon as the next nightly is released
From a Firefox perspective, it seems to me that the main constraint is at the other end: For how long does a given snapshot of packed_simd
stay compilable with future versions of Rust?
Firefox CI uses stable Rust and starts requiring a new Rust version two weeks after its release. encoding_rs
and simd
opt into nightly features on stable Rust using RUSTC_BOOTSTRAP=1
. Additionally, Firefox developers may use a recent-ish nightly Rust locally.
As a result, it's impractical to use a crate whose Rust compatibility doesn't span the current stable with RUSTC_BOOTSTRAP=1
and the current nightly.
My current theory is this:
If a simd-using function that's visible to outside the crate has been annotated as #[inline]
to enable cross-crate inlining and if the benchmarking harness is in a different crate so that cross-crate inlining applies, the version using the simd
crate gains some advantage that packed_simd
does not. This advantage is not shown if any of the following are true:
opt_level=3
So for now, I consider this a case of invalid benchmarking (inlining resulting in a scenario that's unlikely to be representative of real-world use).
In other news, with the into_bits
feature of packed_simd
enabled, the packed_simd
branch of encoding_rs now has no direct calls to transmute
. Going forward, I'm hoping that primitive slice's align_to
and chunks_exact
together with packed_simd
load/store from slice features will allow for a lot of unsafe
code to be written as equally-performant safe code.
For how long does a given snapshot of packed_simd stay compilable with future versions of Rust?
For as long as Rust does not introduce a change that breaks packed_simd
. packed_simd
is a Rust2018 crate, so there have been a lot of these changes in the last couple of months as nightly features have been renamed, stabilized, dropped, etc.
Otherwise, there are periods of higher "rustc simd activity", were newer simd features are implemented over a couple of weeks, typically followed by a couple of months of reaping the results.
Trying to build encoding_rs
with packed_simd
0.3.1 for the thumv7neon
Android target looks like this:
$ cargo build --target thumbv7neon-linux-androideabi --release --features simd-accel
Compiling encoding_rs v0.8.14 (/opt/Projects/encoding_rs)
Compiling packed_simd v0.3.1
error[E0573]: expected type, found local variable `a`
--> src/simd_funcs.rs:315:45
|
315 | let first: u8x16::from_bits(a);
| ^ not a type
error[E0573]: expected type, found local variable `b`
--> src/simd_funcs.rs:316:46
|
316 | let second: u8x16::from_bits(b);
| ^ not a type
error: parenthesized parameters may only be used with a trait
--> src/simd_funcs.rs:315:44
|
315 | let first: u8x16::from_bits(a);
| ^^^
|
= note: #[deny(parenthesized_params_in_types_and_modules)] on by default
= warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
= note: for more information, see issue #42238 <https://github.com/rust-lang/rust/issues/42238>
error: parenthesized parameters may only be used with a trait
--> src/simd_funcs.rs:316:45
|
316 | let second: u8x16::from_bits(b);
| ^^^
|
= warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
= note: for more information, see issue #42238 <https://github.com/rust-lang/rust/issues/42238>
error[E0223]: ambiguous associated type
--> src/simd_funcs.rs:315:28
|
315 | let first: u8x16::from_bits(a);
| ^^^^^^^^^^^^^^^^^^^ help: use fully-qualified syntax: `<packed_simd::Simd<[u8; 16]> as Trait>::from_bits`
error[E0223]: ambiguous associated type
--> src/simd_funcs.rs:316:29
|
316 | let second: u8x16::from_bits(b);
| ^^^^^^^^^^^^^^^^^^^ help: use fully-qualified syntax: `<packed_simd::Simd<[u8; 16]> as Trait>::from_bits`
error: aborting due to 6 previous errors
Some errors occurred: E0223, E0573.
For more information about an error, try `rustc --explain E0223`.
error: Could not compile `encoding_rs`.
To learn more, run the command again with --verbose.
Compiling for x86_64 works OK. Any ideas why the Android target fails?
The failing function is:
#[inline(always)]
pub fn simd_pack(a: u16x8, b: u16x8) -> u8x16 {
unsafe {
let first: u8x16::from_bits(a);
let second: u8x16::from_bits(b);
shuffle!(
first,
second,
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]
)
}
}
Which previously compiled on aarch64. I'm going to try again on that ISA.
Which previously compiled on aarch64. I'm going to try again on that ISA.
I think you want let first = u8x16::from_bits(a);
instead of let first: u8x16::from_bits(a);
🤣 I really do hope that this code did not used to compile before 😛
Thanks. I really thought I had compiled this code on aarch64 previously, but I guess not.
FWIW the error message could have been a lot clearer, you might want to fill in an issue in rust-lang/rust upstream about it.
OK. With this problem fixed, it compiles for thumbv7neon
but compiling for aarch64
fails, because the compiler tries to make a 1 GB memory allocation when compiling packed_simd
0.3.1.
Is this a previously known problem?
memory allocation of 1073741824 bytes failed
error: Could not compile packed_simd
.
Caused by:
process didn't exit successfully: rustc --edition=2018 --crate-name packed_simd /home/hsivonen/.cargo/registry/src/github.com-1ecc6299db9ec823/packed_simd-0.3.1/src/lib.rs --color always --crate-type lib --emit=dep-info,link -C debuginfo=2 --cfg 'feature="default"' --cfg 'feature="into_bits"' -C metadata=2b4eed31d1b1027b -C extra-filename=-2b4eed31d1b1027b --out-dir /home/hsivonen/Projects/encoding_rs/target/debug/deps -L dependency=/home/hsivonen/Projects/encoding_rs/target/debug/deps --extern cfg_if=/home/hsivonen/Projects/encoding_rs/target/debug/deps/libcfg_if-e6de5e9a7e5221ad.rlib --cap-lints allow
(signal: 6, SIGABRT: process abort signal)
Is this a previously known problem?
I haven't seen that one before, which target are you using? aarch64-unknown-linux-android ? It does not surprise me much, rustc consumes a lot of memory when compiling packed_simd
, but compilation passes on targets without memory overcommit like windows on constrained platforms like appveyor so, 1Gb does look like a lot here.
I haven't seen that one before, which target are you using?
Self-hosted aarch64-unknown-linux-gnu
.
Hmm. Trying to compile on an x86_64 host both for aarch64 and x86_64 shows that rustc's memory usage climbs to a bit over 2 GB when compiling packed_simd
0.3.1.
I'm pretty sure that 0.3.0 with earlier rustc didn't have this problem, but given the earlier comments today, I no longer trust myself having actually built this code on aarch64 previously.
I'll see if I can buy more RAM for my aarch64 VM, since it's annoying to run cargo bench
with cross-compile.
Sadly it is not easy to only build a "subset" of packed_simd
(say just 128-bit wide vector support).
I've been thinking about making packed_simd
"lazy", such that code is monomorphized when a user uses a type instead of always at the build time of packed_simd
, but that's a trade-off.
Sadly it is not easy to only build a "subset" of
packed_simd
(say just 128-bit wide vector support).
Maybe I should try commenting out code to get some results.
I'll see if I can buy more RAM for my aarch64 VM
The provider represents their aarch64 capacity as "out of stock", so can't get more RAM today.
Maybe I should try commenting out code to get some results.
You can try commenting out certain APIs, e.g., in src/api.rs
but commenting out vector widths won't easily work (some features like vectors of pointers couple vectors of different lengths).
The build-time RAM requirement can be addressed with swap space plus waiting for the resulting IO, so it seems that building packed_simd
at present doesn't impose a hard requirement on physical RAM.
FWIW the error message could have been a lot clearer, you might want to fill in an issue in rust-lang/rust upstream about it.
In case ripgrep users find their way here:
This Firefox bug has "Depends on" and "See also" fields that point to relevant bugs/issues that are blocking migration at present.
thumbv7neon-unknown-linux-gnueabihf
is now available but I'm still seeing perf regressions on ARMv7 compared to the simd
crate. I'll try to run the benchmarks again to be sure. This was with the crates.io release 0.3.1. I'll try with a git clone.
So far, I don't see a simple characterization of the regression other than it seems to generally affect things that I expect to correlate with horizontal boolean reduction and does not seem to really affect workloads that are mostly about shuffles.
correlate with horizontal boolean reduction
From which types (integers, floats, etc.) are the masks created ?
Unrelated update, some stdsimd refactorings have landed in nightly, and packed_simd should start to build again properly soon.
stdsimd is the replacement for simd