Switch from simd to packed_simd

jrmuizel commented 6 years ago

stdsimd is the replacement for simd

hsivonen commented 6 years ago

Thanks. I hadn't noticed that stdsimd had gained NEON support.

To my surprise, there are no longer distinct boolean vector types in stdsimd. They seem to be gone from WebAssembly SIMD, too!

Yet, I don't see analogs for the all() and any() methods that the boolean vectors had in the simd crate.

I guess I should find out the rationale for the change and what the design intent is for replacements for all() and any().

BurntSushi commented 6 years ago

@hsivonen The current plan is to be as conservative as possible. I don't think the OP, "stdsimd is the replacement for simd," is accurate as of present. My suspicion is that the next phase is probably going to be something like, "stable, but inconvenient portability story."

hsivonen commented 6 years ago

Portable SIMD got merged in to stdsimd, so I'm going to see if encoding_rs can now be ported over.

hsivonen commented 6 years ago

This is happening on a branch for now.

Swapping out the dependency was pretty easy for the portable part. (I didn't switch the intrinsics over yet.) I added a Simd trait with the associated lane type on the encoding_rs side in order to declare the signature for shuffles.

On x86_64, performance regressed for everything the depends heavily on any() and all(). From prior stdsimd source inspection, this was to be expected, since stdsimd computes unnecessary bits if the LLVM optimizer isn't smart enough not to compute them. I haven't looked at the disassembly yet, but it seems to me that either stdsimd needs to allow for architecture-specific manually-written any()/all()/none() or LLVM needs to get more complex pattern matching in the optimizer.

hsivonen commented 6 years ago

Yesterday, I failed to notice that in stdsimd comparing two u16x8s yields b8x8 and there is no b16x8. Commented on the stdsimd issue.

hsivonen commented 6 years ago

stdsimd now has mask vectors which are like simd crate's boolean vectors. The stdsimd branch of encoding_rs compiles again and the x86_64 asm for any() and all() is now a bit better than on March 8, but it's still worse than what the simd crate produces.

stdsimd issue. LLVM bug.

ignatenkobrain commented 6 years ago

any news here?

hsivonen commented 6 years ago

No news yet. I was away for a bit and will need to catch up on the situation regarding any()/all() inlining with stdsimd, especially in the ARMv7+NEON case.

gnzlbg commented 6 years ago

@hsivonen packed_simd fixed all of those issues a while ago, and provides many many new features (shuffles with run-time indices, portable vector gather/scatters, ...).

Let me know if that isn't the case, I think the simd crate can pretty much be removed. All its features are now available in packed_simd, all examples have been ported, etc.

hsivonen commented 6 years ago

@gnzlbg, can I now get fast boolean reductions on ARMv7+NEON without having the standard library compiled with NEON enabled? I.e. do I need to revive that PR and debug the Android Docker issue?

gnzlbg commented 6 years ago

You can if you enable the coresimd feature of the packed_simd crate (it will use coresimd instead of core::arch for the intrinsics, and coresimd can be recompiled by user code without issues, workaround having to recompile the std library).

Note, however, that this uses a particular released version of coresimd in crates.io, which might break at some point due to changes on nightly. If that ever happens, you can just report it in the stdsimd repo so that we do a new release, and in the mean time, patch it to use the stdsimd git repo directly.

gnzlbg commented 6 years ago

I just released version 0.2 of packed_simd, you should prefer that over the 0.1 version, which did not allow doing that (because it relied on coresimd as a git dependency, which had to be disabled for crates.io).

I.e. do I need to revive that PR and debug the Android Docker issue?

Yeah, you should (its almost there, and if you have access to a linux workstation debugging locally shouldn't be too hard). This is a short term workaround that will never work on stable.

hsivonen commented 6 years ago

I've ported encoding_rs to packed_simd on a branch.

At least on x86_64 (Haswell desktop i7), there are repeatable performance differences. Here's the output of per-benchmark best results from 4 simd runs vs. per-benchmark best results from 4 packed_simd runs with threshold of inclusion at 2 percentage points or larger difference:

 name                                          simd.txt ns/iter     packed.txt ns/iter   diff ns/iter   diff %  speedup 
 bench_decode_to_string_tr                     79,661 (3677 MB/s)   77,881 (3761 MB/s)         -1,780   -2.23%   x 1.02 
 bench_decode_to_utf16_en_windows_1252         68,420 (10391 MB/s)  66,839 (10637 MB/s)        -1,581   -2.31%   x 1.02 
 bench_decode_to_utf16_zh_cn_utf_16be          36,308 (15063 MB/s)  37,074 (14751 MB/s)           766    2.11%   x 0.98 
 bench_mem_convert_latin1_to_str_1             3,642 (137 MB/s)     3,510 (142 MB/s)             -132   -3.62%   x 1.04 
 bench_mem_convert_latin1_to_utf8_1            1,823 (274 MB/s)     1,697 (294 MB/s)             -126   -6.91%   x 1.07 
 bench_mem_convert_latin1_to_utf8_3            2,705 (554 MB/s)     2,538 (591 MB/s)             -167   -6.17%   x 1.07 
 bench_mem_convert_str_to_utf16_ascii_1        1,730 (289 MB/s)     1,821 (274 MB/s)               91    5.26%   x 0.95 
 bench_mem_convert_str_to_utf16_ascii_16       2,067 (3870 MB/s)    2,130 (3755 MB/s)              63    3.05%   x 0.97 
 bench_mem_convert_utf16_to_latin1_lossy_1     846 (1182 MB/s)      989 (1011 MB/s)               143   16.90%   x 0.86 
 bench_mem_convert_utf16_to_latin1_lossy_1000  15,605 (64082 MB/s)  16,904 (59157 MB/s)         1,299    8.32%   x 0.92 
 bench_mem_convert_utf16_to_latin1_lossy_15    3,910 (3836 MB/s)    4,001 (3749 MB/s)              91    2.33%   x 0.98 
 bench_mem_convert_utf16_to_latin1_lossy_16    1,823 (8776 MB/s)    1,949 (8209 MB/s)             126    6.91%   x 0.94 
 bench_mem_convert_utf16_to_latin1_lossy_3     1,330 (2255 MB/s)    1,371 (2188 MB/s)              41    3.08%   x 0.97 
 bench_mem_convert_utf16_to_str_ascii_16       4,858 (3293 MB/s)    4,671 (3425 MB/s)            -187   -3.85%   x 1.04 
 bench_mem_convert_utf16_to_utf8_ascii_1       2,928 (341 MB/s)     2,987 (334 MB/s)               59    2.02%   x 0.98 
 bench_mem_convert_utf16_to_utf8_ascii_16      3,353 (4771 MB/s)    3,443 (4647 MB/s)              90    2.68%   x 0.97 
 bench_mem_convert_utf8_to_latin1_lossy_30     6,051 (2478 MB/s)    6,189 (2423 MB/s)             138    2.28%   x 0.98 
 bench_mem_convert_utf8_to_utf16_ascii_1       5,670 (88 MB/s)      5,791 (86 MB/s)               121    2.13%   x 0.98 
 bench_mem_copy_ascii_to_ascii_1               923 (541 MB/s)       1,049 (476 MB/s)              126   13.65%   x 0.88 
 bench_mem_copy_ascii_to_ascii_1000            16,087 (31080 MB/s)  20,463 (24434 MB/s)         4,376   27.20%   x 0.79 
 bench_mem_copy_ascii_to_ascii_15              4,893 (1532 MB/s)    4,544 (1650 MB/s)            -349   -7.13%   x 1.08 
 bench_mem_copy_ascii_to_ascii_16              945 (8465 MB/s)      973 (8221 MB/s)                28    2.96%   x 0.97 
 bench_mem_copy_ascii_to_ascii_3               1,254 (1196 MB/s)    1,286 (1166 MB/s)              32    2.55%   x 0.98 
 bench_mem_copy_ascii_to_ascii_30              5,151 (2912 MB/s)    4,536 (3306 MB/s)            -615  -11.94%   x 1.14 
 bench_mem_copy_basic_latin_to_ascii_1         940 (1063 MB/s)      988 (1012 MB/s)                48    5.11%   x 0.95 
 bench_mem_copy_basic_latin_to_ascii_16        1,340 (11940 MB/s)   1,407 (11371 MB/s)             67    5.00%   x 0.95 
 bench_mem_copy_basic_latin_to_ascii_3         1,422 (2109 MB/s)    1,458 (2057 MB/s)              36    2.53%   x 0.98 
 bench_mem_ensure_utf16_validity_ascii_16      2,962 (5401 MB/s)    3,090 (5177 MB/s)             128    4.32%   x 0.96 
 bench_mem_ensure_utf16_validity_ascii_30      5,429 (5525 MB/s)    5,277 (5685 MB/s)            -152   -2.80%   x 1.03 
 bench_mem_ensure_utf16_validity_bmp_16        2,962 (5401 MB/s)    3,090 (5177 MB/s)             128    4.32%   x 0.96 
 bench_mem_ensure_utf16_validity_bmp_30        5,426 (5528 MB/s)    5,276 (5686 MB/s)            -150   -2.76%   x 1.03 
 bench_mem_ensure_utf16_validity_latin1_16     2,962 (5401 MB/s)    2,838 (5637 MB/s)            -124   -4.19%   x 1.04 
 bench_mem_ensure_utf16_validity_latin1_30     5,406 (5549 MB/s)    5,277 (5685 MB/s)            -129   -2.39%   x 1.02 
 bench_mem_is_basic_latin_false_16             1,446 (11065 MB/s)   1,323 (12093 MB/s)           -123   -8.51%   x 1.09 
 bench_mem_is_basic_latin_false_30             1,810 (16574 MB/s)   1,934 (15511 MB/s)            124    6.85%   x 0.94 
 bench_mem_is_basic_latin_true_1000            14,091 (70967 MB/s)  16,297 (61360 MB/s)         2,206   15.66%   x 0.86 
 bench_mem_is_str_latin1_true_1                643 (777 MB/s)       709 (705 MB/s)                 66   10.26%   x 0.91 
 bench_mem_is_str_latin1_true_1000             28,798 (17362 MB/s)  21,069 (23731 MB/s)        -7,729  -26.84%   x 1.37 
 bench_mem_is_str_latin1_true_15               3,124 (2400 MB/s)    3,772 (1988 MB/s)             648   20.74%   x 0.83 
 bench_mem_is_str_latin1_true_3                1,422 (1054 MB/s)    917 (1635 MB/s)              -505  -35.51%   x 1.55 
 bench_mem_is_str_latin1_true_30               3,623 (4140 MB/s)    3,550 (4225 MB/s)             -73   -2.01%   x 1.02 
 bench_mem_is_utf16_latin1_true_1000           13,932 (71777 MB/s)  14,512 (68908 MB/s)           580    4.16%   x 0.96 
 bench_mem_is_utf8_latin1_false_1000           1,041 (480307 MB/s)  975 (512820 MB/s)             -66   -6.34%   x 1.07 
 bench_mem_is_utf8_latin1_false_16             1,041 (7684 MB/s)    974 (8213 MB/s)               -67   -6.44%   x 1.07 
 bench_mem_is_utf8_latin1_false_30             1,041 (14409 MB/s)   974 (15400 MB/s)              -67   -6.44%   x 1.07 
 bench_mem_is_utf8_latin1_true_15              4,761 (1575 MB/s)    5,403 (1388 MB/s)             642   13.48%   x 0.88 
 bench_mem_is_utf8_latin1_true_16              4,488 (1782 MB/s)    4,890 (1635 MB/s)             402    8.96%   x 0.92 
 bench_mem_is_utf8_latin1_true_3               915 (1639 MB/s)      1,806 (830 MB/s)              891   97.38%   x 0.51 
 bench_mem_utf16_valid_up_to_ascii_3           2,644 (1134 MB/s)    2,579 (1163 MB/s)             -65   -2.46%   x 1.03

Many of the changes are with buffer sizes that are so short that SIMD code isn't used, which suggests flips in branch prediction heuristics or something like that.

The actually worrying cases are these:

 bench_mem_copy_ascii_to_ascii_1000            16,087 (31080 MB/s)  20,463 (24434 MB/s)         4,376   27.20%   x 0.79 
 bench_mem_is_basic_latin_true_1000            14,091 (70967 MB/s)  16,297 (61360 MB/s)         2,206   15.66%   x 0.86 
 bench_mem_is_utf16_latin1_true_1000           13,932 (71777 MB/s)  14,512 (68908 MB/s)           580    4.16%   x 0.96

hsivonen commented 6 years ago

Oh, and those at at opt_level=2, because that's what Firefox builds with.

gnzlbg commented 6 years ago

Which packed_simd features are you enabling ? Which target? Which RUSTFLAGS ?

Could you extract the code of those three benchmarks into its own crate ?

gnzlbg commented 6 years ago

Also the worst benchmark appears to be bench_mem_is_utf8_latin1_true_3

gnzlbg commented 6 years ago

I've left a couple of comments in your branch, there is one suspicious function there that might be causing the regression, we should take a closer look at that comparing the output of the simd and packed_simd crates for the version with and without select (without select, both crates should generate exactly the same code).

hsivonen commented 6 years ago

Which packed_simd features are you enabling ? Which target? Which RUSTFLAGS ?

The above numbers were with the into_bits feature enabled, x86_64-unknown-linux-gnu target, -C opt_level=2 in RUSTFLAGS on Ubuntu 18.04 on Intel Core i7-4770 @ 3.40 GHz.

Could you extract the code of those three benchmarks into its own crate ?

I'll try to minimize the test case.

Also the worst benchmark appears to be bench_mem_is_utf8_latin1_true_3

Indeed. That benchmark shouldn't even run SIMD code, because 3 < 16, which is why I speculated about prediction heuristics flipping or something.

I've left a couple of comments in your branch

Thanks!

there is one suspicious function there that might be causing the regression

The suspicious function affects only x-user-defined to UTF-16 decoding and is performance-neutral. (The change is based on previous advice from you. :-)

gnzlbg commented 6 years ago

I'll try to minimize the test case.

Thank you! Feel free to open an issue in packed_simd and we'll look at it.

hsivonen commented 6 years ago

On aarch64 (ThunderX), the fluctuation of the numbers is a bit larger, so there's are more changes in both directions with the 2-percentage-point threshold. I'm not too worried about that.

Instead of bench_mem_is_utf8_latin1_N with small N regressing, the closely related bench_mem_is_str_latin1_true_N with small N regresses.

(These are with the horizontal max intrinsics instead of the portable horizontal max operations.)

gnzlbg commented 6 years ago

@hsivonen could you try again with packed_simd master branch for both x86 and aarch64?

gnzlbg commented 6 years ago

Maybe there is some inlining issue / target-feature issue popping up, since the simd crate uses platform-intrinsics and the packed_simd crate uses core::arch, so maybe RUSTFLAGS have something to do with this. As in, maybe the simd crate is using a target feature that it shouldn't be allow to use for a particular CPU and that just happens to work because the CPUs that you are trying do support the feature? You could also try with RUSTFLAGS="-C target-cpu=native" and see how they compare, and maybe then try nailing things down with RUSTFLAGS="-C target-feature=+sse4.2" and see what happens.

hsivonen commented 6 years ago

On x86_64 the regressions go away when building with the coresimd feature and without RUSTFLAGS. Up next: Figuring out which one of those makes the difference.

hsivonen commented 6 years ago

It seems that opt_level=2 is the problem. With or without coresimd feature flag affects which tests regress, though.

hsivonen commented 6 years ago

Sorry about the delay. After updating packed_simd (but staying on rustc from the turn of the month, because packaging for clippy blocks updates), I still see perf regressions at opt_level=2 with encoding_bench (and now also a regression of bench_mem_utf16_valid_up_to_ascii_1000 at opt_level=3). However, when trying to create a smaller test case for the opt_level=2 regressions by extracting the code to a single-file crate, the regressions go away.

Still more to investigate.

Meanwhile: It appears that cargo is getting the capability to set per-crate opt_level, so maybe this is going to be solved by building encoding_rs at opt_level=3 in Firefox. OTOH, a build system peer considered packed_simd's compatibilily policy with only the latest nightly as a blocker, so it seems that migration from simd to packed_simd has to wait until packed_simd makes it into the standard library.

BurntSushi commented 6 years ago

OTOH, a build system peer considered packed_simd's compatibilily policy with only the latest nightly as a blocker, so it seems that migration from simd to packed_simd has to wait until packed_simd makes it into the standard library.

I thought that was simd's policy also? At least, in the past, IIRC, if a nightly came out that broke simd, we would fix it without necessarily keeping compatibility with older nightlies.

It may be that packed_simd makes this more of a problem in practice since it is being actively developed.

gnzlbg commented 6 years ago

I thought that was simd's policy also?

It is simd's crate policy and core::arch::{arm, aarch64} policy as well.

hsivonen commented 6 years ago

However, when trying to create a smaller test case for the opt_level=2 regressions by extracting the code to a single-file crate, the regressions go away.

Specifically, the code using simd and the code using packed_simd compile to identical assembly except for label naming.

I thought that was simd's policy also?

In practice, simd isn't being developed further, so it doesn't adopt new Rust features that would require a new compiler version and, OTOH, rustc hasn't broken the relevant parts of simd in a long while.

What's the practical breakage situation with packed_simd?

gnzlbg commented 6 years ago

What's the practical breakage situation with packed_simd?

The library cannot build on stable Rust, only on nightly. It is still being developed, with new features, bug fixes, and performance fixes being added every now and then. Sometimes these require patching rustc, and we tend to then start using the features as soon as the next nightly is released, and until now the only stakeholders that complain are the rand developers, but we tend to let them know when breaking changes happen and we help with the PR required to upgrade. This isn't too hard because rand unstable follows the latest nightly anyways.

If a bigger stake holder (e.g. servo / firefox) with stricter "unstable stability requirements" starts depending on packed_simd, then we'll have to adapt our workflow around those requirements.

For example, we could add a CI build bot to test the nightly version used by stakeholders, and add any new features that would break that behind a feature gate. Once all stakeholders upgrade, then we could remove those feature gates.

hsivonen commented 6 years ago

The library cannot build on stable Rust, only on nightly.

Indeed, packed_simd doesn't build with 1.29.2 even with RUSTC_BOOTSTRAP=1.

Sometimes these require patching rustc, and we tend to then start using the features as soon as the next nightly is released

From a Firefox perspective, it seems to me that the main constraint is at the other end: For how long does a given snapshot of packed_simd stay compilable with future versions of Rust?

Firefox CI uses stable Rust and starts requiring a new Rust version two weeks after its release. encoding_rs and simd opt into nightly features on stable Rust using RUSTC_BOOTSTRAP=1. Additionally, Firefox developers may use a recent-ish nightly Rust locally.

As a result, it's impractical to use a crate whose Rust compatibility doesn't span the current stable with RUSTC_BOOTSTRAP=1 and the current nightly.

hsivonen commented 5 years ago

My current theory is this:

If a simd-using function that's visible to outside the crate has been annotated as #[inline] to enable cross-crate inlining and if the benchmarking harness is in a different crate so that cross-crate inlining applies, the version using the simd crate gains some advantage that packed_simd does not. This advantage is not shown if any of the following are true:

The benchmarking harness and the code being benchmarked are in the same crate.
The benchmarking harness and the code being benchmarked are in different crates but there's no cross-crate inlining.
opt_level=3

So for now, I consider this a case of invalid benchmarking (inlining resulting in a scenario that's unlikely to be representative of real-world use).

In other news, with the into_bits feature of packed_simd enabled, the packed_simd branch of encoding_rs now has no direct calls to transmute. Going forward, I'm hoping that primitive slice's align_to and chunks_exact together with packed_simd load/store from slice features will allow for a lot of unsafe code to be written as equally-performant safe code.

gnzlbg commented 5 years ago

For how long does a given snapshot of packed_simd stay compilable with future versions of Rust?

For as long as Rust does not introduce a change that breaks packed_simd. packed_simd is a Rust2018 crate, so there have been a lot of these changes in the last couple of months as nightly features have been renamed, stabilized, dropped, etc.

Otherwise, there are periods of higher "rustc simd activity", were newer simd features are implemented over a couple of weeks, typically followed by a couple of months of reaping the results.

hsivonen commented 5 years ago

Trying to build encoding_rs with packed_simd 0.3.1 for the thumv7neon Android target looks like this:

$ cargo build --target thumbv7neon-linux-androideabi --release --features simd-accel
   Compiling encoding_rs v0.8.14 (/opt/Projects/encoding_rs)
   Compiling packed_simd v0.3.1
error[E0573]: expected type, found local variable `a`
   --> src/simd_funcs.rs:315:45
    |
315 |                 let first: u8x16::from_bits(a);
    |                                             ^ not a type

error[E0573]: expected type, found local variable `b`
   --> src/simd_funcs.rs:316:46
    |
316 |                 let second: u8x16::from_bits(b);
    |                                              ^ not a type

error: parenthesized parameters may only be used with a trait
   --> src/simd_funcs.rs:315:44
    |
315 |                 let first: u8x16::from_bits(a);
    |                                            ^^^
    |
    = note: #[deny(parenthesized_params_in_types_and_modules)] on by default
    = warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
    = note: for more information, see issue #42238 <https://github.com/rust-lang/rust/issues/42238>

error: parenthesized parameters may only be used with a trait
   --> src/simd_funcs.rs:316:45
    |
316 |                 let second: u8x16::from_bits(b);
    |                                             ^^^
    |
    = warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
    = note: for more information, see issue #42238 <https://github.com/rust-lang/rust/issues/42238>

error[E0223]: ambiguous associated type
   --> src/simd_funcs.rs:315:28
    |
315 |                 let first: u8x16::from_bits(a);
    |                            ^^^^^^^^^^^^^^^^^^^ help: use fully-qualified syntax: `<packed_simd::Simd<[u8; 16]> as Trait>::from_bits`

error[E0223]: ambiguous associated type
   --> src/simd_funcs.rs:316:29
    |
316 |                 let second: u8x16::from_bits(b);
    |                             ^^^^^^^^^^^^^^^^^^^ help: use fully-qualified syntax: `<packed_simd::Simd<[u8; 16]> as Trait>::from_bits`

error: aborting due to 6 previous errors

Some errors occurred: E0223, E0573.
For more information about an error, try `rustc --explain E0223`.
error: Could not compile `encoding_rs`.

To learn more, run the command again with --verbose.

Compiling for x86_64 works OK. Any ideas why the Android target fails?

hsivonen commented 5 years ago

The failing function is:

        #[inline(always)]
        pub fn simd_pack(a: u16x8, b: u16x8) -> u8x16 {
            unsafe {
                let first: u8x16::from_bits(a);
                let second: u8x16::from_bits(b);
                shuffle!(
                    first,
                    second,
                    [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]
                )
            }
        }

Which previously compiled on aarch64. I'm going to try again on that ISA.

gnzlbg commented 5 years ago

Which previously compiled on aarch64. I'm going to try again on that ISA.

I think you want let first = u8x16::from_bits(a); instead of let first: u8x16::from_bits(a); 🤣 I really do hope that this code did not used to compile before 😛

hsivonen commented 5 years ago

Thanks. I really thought I had compiled this code on aarch64 previously, but I guess not.

gnzlbg commented 5 years ago

FWIW the error message could have been a lot clearer, you might want to fill in an issue in rust-lang/rust upstream about it.

hsivonen commented 5 years ago

OK. With this problem fixed, it compiles for thumbv7neon but compiling for aarch64 fails, because the compiler tries to make a 1 GB memory allocation when compiling packed_simd 0.3.1.

Is this a previously known problem?

memory allocation of 1073741824 bytes failed error: Could not compile packed_simd.

Caused by: process didn't exit successfully: rustc --edition=2018 --crate-name packed_simd /home/hsivonen/.cargo/registry/src/github.com-1ecc6299db9ec823/packed_simd-0.3.1/src/lib.rs --color always --crate-type lib --emit=dep-info,link -C debuginfo=2 --cfg 'feature="default"' --cfg 'feature="into_bits"' -C metadata=2b4eed31d1b1027b -C extra-filename=-2b4eed31d1b1027b --out-dir /home/hsivonen/Projects/encoding_rs/target/debug/deps -L dependency=/home/hsivonen/Projects/encoding_rs/target/debug/deps --extern cfg_if=/home/hsivonen/Projects/encoding_rs/target/debug/deps/libcfg_if-e6de5e9a7e5221ad.rlib --cap-lints allow (signal: 6, SIGABRT: process abort signal)

gnzlbg commented 5 years ago

Is this a previously known problem?

I haven't seen that one before, which target are you using? aarch64-unknown-linux-android ? It does not surprise me much, rustc consumes a lot of memory when compiling packed_simd, but compilation passes on targets without memory overcommit like windows on constrained platforms like appveyor so, 1Gb does look like a lot here.

hsivonen commented 5 years ago

I haven't seen that one before, which target are you using?

Self-hosted aarch64-unknown-linux-gnu.

hsivonen commented 5 years ago

Hmm. Trying to compile on an x86_64 host both for aarch64 and x86_64 shows that rustc's memory usage climbs to a bit over 2 GB when compiling packed_simd 0.3.1.

I'm pretty sure that 0.3.0 with earlier rustc didn't have this problem, but given the earlier comments today, I no longer trust myself having actually built this code on aarch64 previously.

I'll see if I can buy more RAM for my aarch64 VM, since it's annoying to run cargo bench with cross-compile.

gnzlbg commented 5 years ago

Sadly it is not easy to only build a "subset" of packed_simd (say just 128-bit wide vector support).

I've been thinking about making packed_simd "lazy", such that code is monomorphized when a user uses a type instead of always at the build time of packed_simd, but that's a trade-off.

hsivonen commented 5 years ago

Sadly it is not easy to only build a "subset" of packed_simd (say just 128-bit wide vector support).

Maybe I should try commenting out code to get some results.

I'll see if I can buy more RAM for my aarch64 VM

The provider represents their aarch64 capacity as "out of stock", so can't get more RAM today.

gnzlbg commented 5 years ago

Maybe I should try commenting out code to get some results.

You can try commenting out certain APIs, e.g., in src/api.rs but commenting out vector widths won't easily work (some features like vectors of pointers couple vectors of different lengths).

hsivonen commented 5 years ago

The build-time RAM requirement can be addressed with swap space plus waiting for the resulting IO, so it seems that building packed_simd at present doesn't impose a hard requirement on physical RAM.

hsivonen commented 5 years ago

FWIW the error message could have been a lot clearer, you might want to fill in an issue in rust-lang/rust upstream about it.

Filed

hsivonen commented 5 years ago

Filed an issue about rustc memory usage when compiling packed_simd.

hsivonen commented 5 years ago

In case ripgrep users find their way here:

This Firefox bug has "Depends on" and "See also" fields that point to relevant bugs/issues that are blocking migration at present.

hsivonen commented 5 years ago

thumbv7neon-unknown-linux-gnueabihf is now available but I'm still seeing perf regressions on ARMv7 compared to the simd crate. I'll try to run the benchmarks again to be sure. This was with the crates.io release 0.3.1. I'll try with a git clone.

So far, I don't see a simple characterization of the regression other than it seems to generally affect things that I expect to correlate with horizontal boolean reduction and does not seem to really affect workloads that are mostly about shuffles.

gnzlbg commented 5 years ago

correlate with horizontal boolean reduction

From which types (integers, floats, etc.) are the masks created ?

Unrelated update, some stdsimd refactorings have landed in nightly, and packed_simd should start to build again properly soon.

hsivonen / encoding_rs

Switch from simd to packed_simd #23