Consider runtime CPU feature detection

Shnatsel commented 1 month ago

I've tried the multiversion crate on top of #513 just to see what would happen.

In unfiltering, only Paeth benefits. On stable the only important change is bpp=4 improving by 15%, while 3 and 1 are unchanged. Also 2 regresses while 6 and 8 improve, but those are for 16-bit images only so they don't matter as much.

Explicit SIMD benefits a bit more, with +7% on 3bpp and +17% on 4, 6 and 8. Same regression on 2, but 2 is rare, so who cares.

There is no benefit to using AVX or AVX2 - SSE4.1 is sufficient to get optimal performance in unfiltering. This is not surprising since we literally never use 256-bit vectors in unfiltering. AVX might still be of use in filtering because it does operate on 256-bit wide chunks, but we don't have any benchmarks for those at present, so I haven't tried that yet.

Full unfiltering benchmark results on #513, stable

``` unfilter/filter=Paeth/bpp=1 time: [7.5264 µs 7.5265 µs 7.5265 µs] thrpt: [519.00 MiB/s 519.00 MiB/s 519.01 MiB/s] change: time: [-0.2168% -0.2042% -0.1927%] (p = 0.00 < 0.05) thrpt: [+0.1931% +0.2046% +0.2173%] Change within noise threshold. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe unfilter/filter=Paeth/bpp=2 time: [12.690 µs 12.704 µs 12.724 µs] thrpt: [613.99 MiB/s 614.96 MiB/s 615.66 MiB/s] change: time: [+28.839% +28.938% +29.086%] (p = 0.00 < 0.05) thrpt: [-22.532% -22.444% -22.384%] Performance has regressed. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) low mild 2 (2.00%) high mild 4 (4.00%) high severe unfilter/filter=Paeth/bpp=3 time: [12.801 µs 12.801 µs 12.802 µs] thrpt: [915.42 MiB/s 915.44 MiB/s 915.46 MiB/s] change: time: [-0.0576% -0.0543% -0.0509%] (p = 0.00 < 0.05) thrpt: [+0.0509% +0.0543% +0.0576%] Change within noise threshold. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 3 (3.00%) high mild 2 (2.00%) high severe unfilter/filter=Paeth/bpp=4 time: [11.142 µs 11.144 µs 11.147 µs] thrpt: [1.3689 GiB/s 1.3692 GiB/s 1.3695 GiB/s] change: time: [-13.170% -13.092% -13.026%] (p = 0.00 < 0.05) thrpt: [+14.977% +15.064% +15.168%] Performance has improved. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low severe 1 (1.00%) low mild 1 (1.00%) high mild 2 (2.00%) high severe unfilter/filter=Paeth/bpp=6 time: [11.292 µs 11.292 µs 11.292 µs] thrpt: [2.0270 GiB/s 2.0270 GiB/s 2.0270 GiB/s] change: time: [-11.799% -11.784% -11.772%] (p = 0.00 < 0.05) thrpt: [+13.342% +13.359% +13.377%] Performance has improved. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe unfilter/filter=Paeth/bpp=8 time: [11.263 µs 11.267 µs 11.271 µs] thrpt: [2.7077 GiB/s 2.7087 GiB/s 2.7095 GiB/s] change: time: [-25.694% -25.606% -25.533%] (p = 0.00 < 0.05) thrpt: [+34.287% +34.419% +34.578%] Performance has improved. Found 9 outliers among 100 measurements (9.00%) 2 (2.00%) low mild 2 (2.00%) high mild 5 (5.00%) high severe ```

Full unfiltering benchmark results on #512, nightly

``` unfilter/filter=Paeth/bpp=1 time: [7.4692 µs 7.4893 µs 7.5095 µs] thrpt: [520.18 MiB/s 521.58 MiB/s 522.98 MiB/s] change: time: [-0.3953% -0.2773% -0.1736%] (p = 0.00 < 0.05) thrpt: [+0.1739% +0.2780% +0.3969%] Change within noise threshold. Found 13 outliers among 100 measurements (13.00%) 9 (9.00%) low severe 2 (2.00%) high mild 2 (2.00%) high severe unfilter/filter=Paeth/bpp=2 time: [12.572 µs 12.585 µs 12.599 µs] thrpt: [620.10 MiB/s 620.80 MiB/s 621.44 MiB/s] change: time: [+29.756% +30.235% +30.720%] (p = 0.00 < 0.05) thrpt: [-23.501% -23.215% -22.932%] Performance has regressed. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe unfilter/filter=Paeth/bpp=3 time: [10.142 µs 10.147 µs 10.153 µs] thrpt: [1.1272 GiB/s 1.1278 GiB/s 1.1284 GiB/s] change: time: [-7.2193% -7.1641% -7.1059%] (p = 0.00 < 0.05) thrpt: [+7.6494% +7.7169% +7.7810%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) low mild 5 (5.00%) high mild unfilter/filter=Paeth/bpp=4 time: [8.9340 µs 8.9373 µs 8.9433 µs] thrpt: [1.7062 GiB/s 1.7073 GiB/s 1.7080 GiB/s] change: time: [-14.910% -14.813% -14.733%] (p = 0.00 < 0.05) thrpt: [+17.279% +17.389% +17.523%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) low mild 1 (1.00%) high mild 4 (4.00%) high severe unfilter/filter=Paeth/bpp=6 time: [9.7180 µs 9.7236 µs 9.7307 µs] thrpt: [2.3522 GiB/s 2.3539 GiB/s 2.3552 GiB/s] change: time: [-14.878% -14.698% -14.543%] (p = 0.00 < 0.05) thrpt: [+17.017% +17.231% +17.478%] Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 4 (4.00%) high mild 1 (1.00%) high severe unfilter/filter=Paeth/bpp=8 time: [8.9602 µs 8.9612 µs 8.9624 µs] thrpt: [3.4051 GiB/s 3.4055 GiB/s 3.4059 GiB/s] change: time: [-14.968% -14.708% -14.522%] (p = 0.00 < 0.05) thrpt: [+16.990% +17.244% +17.603%] Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild ```

The drawbacks of multiversioning are slightly increased binary size and some trivial unsafe code in a dependency, but we get to keep our #![forbid(unsafe_code)].

I'm considering adding an optional feature for multiversioning.

@anforowicz would this be of interest to Chromium? Right now Chromium only mandates SSE3, while we get improved performance from SSE 4.1, so it should be beneficial. Are these gains meaningful? Would this kind of transitive unsafe be trivial enough to enable?

fintelia commented 1 month ago

I'm not thrilled about introducing proc-macro generated unsafe code. At the moment we've reduced the (non-std) transitive uses of unsafe to only alder32 and CRC computations, and the former at least may plausibly by replaced by std::simd usage in the not too distant future.

Shnatsel commented 1 month ago

I think multiversioning is low-risk as far as unsafe code goes because unlike checksum computation it never operates on attacker-controlled values.

fintelia commented 1 month ago

Assuming the performance is compelling enough, I think multiversion is fine as an optional disabled-by-default feature.

I have no reason to believe multiversion has any soundness issues, but to explain a bit why I feel slightly uncomfortable with it: It doesn't operate on attacker-controlled program inputs which certainly reduces risk, but IMO generating the unsafe blocks via proc macros adds its own risks. Or perhaps increases the difficulty of auditing for the remaining risks. Like consider this unsafe block. You can't simply look at the immediate surrounding context to figure out if it is OK, but rather need to have a broad understanding of the rest of the crate to convince yourself.

Another approach is inspecting the output after macro expansion, but that doesn't tell you whether seemingly innocuous changes to the macro expansion site could result in different and unsound code being produced.

anforowicz commented 1 month ago

@anforowicz would this be of interest to Chromium?

I don't have a strong opinion.

I think that the Chrome Security team would not be fundamentally opposed to unsafe if it can be somehow encapsulated and audited (*). It seems that the multiversion crate does encapsulate the unsafe bits, but I don't know how comfortable we would be with auditing it (/cc @danakj). It is also not entirely clear to me if mutliversion relies entirely on auto-vectorization (potentially with a more powerful/capable target) VS if multiversion allows using architecture-specific SIMD intrinsics without unsafe (I heard that @veluca93 has been exploring an approach that utilizes some nightly-rustc features for target detection so that thin wrappers over SIMD intrinsics can be used without unsafe; I don't know what the current status of that is).
I don't have a good sense of benefit/cost here.
- On one hand, having a good first impression (e.g. from great codec performance) would go a long way toward making people sympathetic toward Rust adoption. OTOH, it is not clear how much of an impact codec performance would have on end-to-end metrics like time to first contentful paint, etc. And we don't really have robust measurements of the current performance delta in real world scenarios (I hope to get those within a quarter or two).
- It seems difficult to predict which optimizations can give the best returns. Unfiltering performance is certainly on the radar, but I am not sure how much of an impact unfiltering/SIMD can have on small images (say, less than 32 pixels wide) which AFAIK are where Rust is at a disadvantage today.

(*) I understand that auditability is not a black/white property of a crate, but it really helps when:

unsafe blocks can be proven to be sound locally (without having to consider the whole crate and/or code outside of the crate)
unsafe blocks come with explicit safety comments that explain why the safety requirements are always met (e.g. requirements of unsafe functions called from the unsafe block; or requirements of whatever other unsafe operations are done within the block)

And the example linked above by @fintelia above illustrates how an audit may be tricky.

HeroicKatora commented 1 month ago

I understand the concern about auditability, but I don't really see optional dependencies as a particularly good way of addressing it.

We do include the standard library. Consequently it's clear that the presence of unsafe code itself is not cause for concern about a library, but a dependencies willingness and ability to ensure sound implementations of the unsafe usage. I like some unsafe code. For instance bytemuck is perfectly filling the blank (temporal) space between the a usable and highly useful abstraction now and some future builtin safe transmutation. I don't want to operate under a requirement of it becoming a full replacement and really the semantic difference will be slim if any. The ability to encapsulate unsafe behind sound usage is a large selling point. Inclusion of the encapsulation into std or the compiler mainly gets more eyeballs in review and sometimes a better understanding of under-specified boundaries.

Runtime feature detection as in multiversion seems very similar to the safe-transmute issue. The (necessary) soundness requisites are clear. It's also quite obvious why core would be cautious in providing intrinsic support: the design space for additional function attributes is just too large.

As a counter-example, I'd very much like to avoid dependencies such as lexical-core<1.0. The amount of unsafe was a concern, but not the problem per-se. They were playing it loose with regards to verification. We should somehow be able to qualify that instead of the mere existence of unsafe. And at the same time the consideration of necessity was not posed, a bunch of reactions to the RUSTSEC seem to be able to remove the dependency outright.

I'm not in favor of optional dependencies as a solution. It's a weak mitigation in itself against the purely potential threat of unsafe code, imo. What would downstream code do in case an unsound code is discovered? Removing the feature is a patch but then again we or the underlying dependency should patch, too. So this sounds quite redundant. And is a patch guaranteed to work? We've extended the libraries configuration matrix considerably if we do each of them. Is everything still guaranteed to work?

Alternative to consider:

A team of unsafe reviews, aka. cargo-crev or similar for unsafe dependencies, with a CI bot to trigger re-reviews whenever dependencies introduce / change additional unsafe code. Try not to repeat my mistake. And since this is additional effort, I'd like to find out if the Chrome Security team can somehow contribute to such reviews as they may be shared (not as much to avoid a monoculture if we're considering resilience though). For instance we should track each release of bytemuck, if possible pro-actively. Identifying the dependencies that do need reasoning reviews is half the battle here.

Shnatsel commented 1 month ago

Google already publishes their security team's audits through cargo vet: https://opensource.googleblog.com/2023/05/open-sourcing-our-rust-crate-audits.html

Shnatsel commented 1 month ago

Okay, so I've experimented with this a bit in #515. This is what the trade-offs look like:

In decoding, multiversioning gets up to 4% better end-to-end performance on large images that use Paeth filter, on some CPUs. On CPUs that do not benefit, there is a potential for a ~5% end-to-end improvement from multiversioning something other than the filters (source). I'd say we shouldn't be pursuing multiversioning in decoding while other opportunities for performance improvements are available, such as https://github.com/image-rs/fdeflate/pull/31. The development effort is better spent there, not on multiversioning.

There are promising results for encoding, but not in multiversioning filters. This might be worth following up on, but AFAIK Chromium doesn't use this crate for encoding so this will not be relevant to Skia/Chromium use cases.

Shnatsel commented 1 month ago

I've tried compiling the entire decode+encode roundtrip with RUSTFLAGS='-C target-cpu=x86-64-v3' and got a 10% end-to-end speedup when using fast compression and adaptive filtering from it, but it's not localized to any single function. The profiles look almost identical, barring a slight reduction in filtering time that cannot account for the difference on its own.

Baseline: https://share.firefox.dev/4dzkdW8 V3: https://share.firefox.dev/4dCiWhc

Code used for measurements

```rust use std::{fs::File, io::BufWriter}; fn main() { // loop 10 times to get lots of samples and a more clear profile for _i in 0..10 { let filename = std::env::args_os().nth(1).expect("No input file provided"); let mut decoder = png::Decoder::new(File::open(filename).unwrap()); // Set the decoder to expand palletted images into RGBA instead of simply outputting palette indices decoder.set_transformations(png::Transformations::EXPAND); let mut reader = decoder.read_info().unwrap(); // Allocate the output buffer. let mut buf = vec![0; reader.output_buffer_size()]; // Read the next frame. An APNG might contain multiple frames. let info = reader.next_frame(&mut buf).unwrap(); let mut out_info = png::Info::default(); out_info.width = info.width; out_info.height = info.height; out_info.color_type = info.color_type; out_info.bit_depth = info.bit_depth; let out_file = File::create("/tmp/output.png").unwrap(); let ref mut w = BufWriter::new(out_file); let mut encoder = png::Encoder::with_info(w, out_info).unwrap(); encoder.set_compression(png::Compression::Fast); encoder.set_adaptive_filter(png::AdaptiveFilterType::Adaptive); let mut writer = encoder.write_header().unwrap(); writer.write_image_data(&buf).unwrap(); } } ```

So it seems we won't be able to get these gains via multiversioning, or at least with anything short of cargo multivers.

Therefore, given the concerns about the unsafe code generated by proc macros, I don't think there is a sufficiently strong performance benefit to justify multiversioning at this time.

image-rs / image-png

Consider runtime CPU feature detection #514

Alternative to consider: