Closed johannesvollmer closed 1 year ago
before merging i want to have unit tests for that function and i want to clean it up, deduplicate the code, make it rusty
also i wonder whether the batch size of 4 can allow the compiler to optimize away all of the chunking logic in HalfFloatSliceExt::convert_from_f32_slice
? Unfortunately we need to define a batch size, in order to avoid allocating a temporary buffer on the heap
Benchmarks: with half = {git = "https://github.com/starkat99/half-rs", features = ["use-intrinsics"]}
in Cargo.toml
I'm seeing the conversion time drop so much that it becomes unnoticeable, indistinguishable from the regular f32
to f32
codepath.
The f32
to u32
path doesn't get any faster, however. Maybe autovectorization doesn't kick in. I've tried with -C target-cpu=native
for comparison, no difference.
ARM has native instructions for casting from f32
to unsigned integers (details). This would explain why I'm seeing good results for f32
to u32
conversion on ARM, even without this change.
I haven't found any mentions of native f32
to u32
casts on x86. ChatGPT mentioned some but I think it's making it up, and when I corrected it, it said this:
Or here's a human discussion of a similar conversion (albeit to u8): https://stackoverflow.com/questions/29856006/sse-intrinsics-convert-32-bit-floats-to-unsigned-8-bit-integers
This also regresses the fallback path on half 2.2.0
but not on the latest git; we will need to switch to latest half
so that we don't introduce regressions when intrinsics are not available.
Yeah, there seems to be no native conversion from f32
to u32
on x86_64 without AVX-512: https://stackoverflow.com/questions/41144668/how-to-efficiently-perform-double-int64-conversions-with-sse-avx
There is a conversion from f32
to i32
, but that would truncate large numbers that fit into u32
but not i32
, and therefore produce incorrect results (although u32
cannot represent all f32
values in the first place).
Why is u32
is the chosen format, anyway?
Thanks for the feedback! Conversion from and to u32
are only for completeness. I don't expect it to ever happen in the real world. We should not worry about inferior performance here at all. As you said, it's not even accurate for large numbers.
I'm seeing the conversion time drop so much that it becomes unnoticeable, indistinguishable from the regular f32 to f32 codepath.
Awesome, didn't expect that much of a speed up!
Do we have any regression concerning f32 -> f32
?
Why is u32 is the chosen format, anyway?
the purpose of u32
samples is too assign different areas in the image unique IDs. The common case, rgba images, are either full f32
or full f16
.
In the cases where u32
is used, it is certainly planned and not converted to any float type.
Do we have any regression concerning f32 -> f32?
Nope, it's the exact same on my machine. I guess the buffer does fit entirely into the L1 cache, it's not big.
added a Todo list in the pr text. anything else to add to that list?
The necessary changes to half
have shipped in version 2.2.1
The code is not pretty, but it will be good enough for the moment. If there is no performance regression, I will merge it. @Shnatsel do you want to make sure it is still properly optimized after the refactorings?
You should see the difference on cargo bench
yourself, but sure, I can take a look over the weekend.
Yes, I was planning to see the benchmarks, sure :)
Don't worry, take the time you need, does not have to be this weekend :)
Strangely, on my machine, there is no performance gain whatsoever, rather within the margin of error, it is the same, if not worse. Note that this is the first time I benchmarked the code for this PR. All benchmarks used the newest version of half
, I pulled only the function implementation from different commits. Ran cargo bench
on a windows 11 machine with an intel i7 8th gen, most up-to-date rustc.
# no batch conversion
read_image_rgba_f32_to_f32 ... bench: 19,554,980 ns/iter (+/- 2,367,993)
read_image_rgba_f32_to_f16 ... bench: 40,168,210 ns/iter (+/- 1,880,333)
read_image_rgba_f32_to_u32 ... bench: 23,609,890 ns/iter (+/- 2,525,909)
# with batch conversion, dirty
read_image_rgba_f32_to_f32 ... bench: 19,643,090 ns/iter (+/- 1,126,170)
read_image_rgba_f32_to_f16 ... bench: 45,213,290 ns/iter (+/- 2,912,926)
read_image_rgba_f32_to_u32 ... bench: 26,500,130 ns/iter (+/- 2,043,967)
# with batch conversion, refactored
read_image_rgba_f32_to_f32 ... bench: 19,653,510 ns/iter (+/- 2,083,299)
read_image_rgba_f32_to_f16 ... bench: 43,848,780 ns/iter (+/- 2,427,347)
read_image_rgba_f32_to_u32 ... bench: 26,658,960 ns/iter (+/- 3,115,560)
if there is something the library user/me needs to do for good performance, this is the perfect time for some documentation :)
some flags anywhere? maybe target cpu set to native? maybe a feature flag in half?
Yes, you need to enable the use-intrinsics
feature in half
for performance. It currently requires nightly, but will be stabilized in the upcoming Rust 1.68.
It's possible to make the implementation for 64-bit ARM to work on stable, but that's not yet merged into half
: https://github.com/starkat99/half-rs/issues/63
Latest code by default:
test read_image_rgba_f32_to_f16 ... bench: 38,017,728 ns/iter (+/- 132,337)
With the use-intrinsics
feature:
test read_image_rgba_f32_to_f16 ... bench: 14,419,196 ns/iter (+/- 146,359)
Also there's no need to set -Ctarget-cpu=native
, since half
performs runtime feature detection.
ARM support is coming in the next half
release that will require Rust 1.69: https://github.com/starkat99/half-rs/issues/63#issuecomment-1399847097
I propose we introduce a feature flag to exr
. The feature flag will enable use-intrinsics
in half
, but may be used for other purposes in the future. This way, people can still use the newest library with a slightly older old compiler. Maybe a flag of the name nightly
? or "bleeding-edge"?
I suggest we simply wait for the next half
release and switch to it once it's released. I understand it won't require any additional features.
I don't think exposing an exr
feature that's only going to be relevant until Rust 1.69 ships is a good idea. Better document the workaround with depending on half
and enabling the feature in it in the interim.
For context, half
is going to do a major version bump with the stabilization of x86 intrinsics and the addition of ARM intrinsics. I don't think there's a way to select a major version of a library depending on a feature.
Speaking of, since it will take until Rust 1.69 to reap the benefits of this change on stable, it might be better to ship to make a new release without this PR.
We can merge it immediately afterwards, and cut a new release once Rust 1.69 has shipped.
Do I understand correctly that the newest Rust version is only required when we enable the use-intrinsics
flag in half
? In that case, if we leave the decision up to the user, it will be fine to just update the half
dependency and let the people who have access to nightly Rust enable the flag.
Yes, you understand correctly - that is the behavior of the currently released 2.2.1 version.
A version requiring 1.69 will ship at a later date and will likely either enable the feature by default or remove it outright. It will also come with a major semver bump in half
.
aight, Rust 1.68 is here now!
you need to enable the use-intrinsics feature in half for performance. It currently requires nightly, but will be stabilized in the upcoming Rust 1.68.
I'll try to find out whether we need a new release of half
for the stable feature flag
I believe stabilization of the required intrinsics got pushed back to 1.69, let me double-check that.
yes, I think I didn't see any intrinsic being mentioned in the changelog
So the stabilization PR is https://github.com/rust-lang/stdarch/pull/1366
It clearly didn't ship in 1.68
half
2.3.1 got rid of the feature flag and enabled the use of intrinsics by default, so no changes to half
feature flags are needed on our side. This PR should be good to go, just needs unit tests.
great news, now we finally see the expected speedups in stable without setting a flag :) nice
cargo bench pixel format conversions
new read_image_rgba_f32_to_f16 ... bench: 18,413,400 ns/iter (+/- 2,459,747)
old read_image_rgba_f32_to_f16 ... bench: 29,149,950 ns/iter (+/- 3,587,253)
These big speedups apply under the following conditions:
In other situations, the speedups will be smaller, if any.
The other formats seem to have no significant regression:
new test read_image_rgba_f32_to_f32 ... bench: 20,100,990 ns/iter (+/- 2,852,151)
old test read_image_rgba_f32_to_f32 ... bench: 19,364,750 ns/iter (+/- 2,143,720)
new test read_image_rgba_f32_to_u32 ... bench: 26,969,190 ns/iter (+/- 2,512,176)
old test read_image_rgba_f32_to_u32 ... bench: 26,253,080 ns/iter (+/- 3,488,182)
Thanks for the review :)
I forced half
to the newest version, to avoid regressions, and added a unit test.
Ready to merge. Any objections?
I would require half
2.2 but put 2.3.1
in the Cargo.lock
to avoid breaking existing users by enforcing a higher MSRV, and add a note to the README about performance.
IMO a slower f16
conversion is not as critical as the code failing to build in the first place.
no compression used
I believe I changed those tests to read to f32
rather than f16
, which is why you're not seeing a speedup there. If you change them back to f16
you will see a 9,000,000 ns/iter difference there too, because the format conversion is single-threaded.
I don't have a strong opinion on bumping half
, however.
I have no objections to merging this PR.
okay :)
one more observation about the benchmarks: reading an image with f16 to f32 conversion is now as fast as reading f32 to f32 (no conversion). awesome!
Good point about the MSRV. Didn't know that it changed. This repo has a CI test that checks for the MSRV still being valid - it seems this doesn't work correctly
Added more benchmarks, everything looks as expected still. neat!
test read_f16_as_f16_uncompressed_1thread ... bench: 11,665,580 ns/iter (+/- 2,527,286)
test read_f16_as_u32_uncompressed_1thread ... bench: 11,732,710 ns/iter (+/- 957,454)
test read_f16_as_f32_uncompressed_1thread ... bench: 11,661,750 ns/iter (+/- 716,012)
test read_f16_as_f16_zip_nthreads ... bench: 13,345,020 ns/iter (+/- 1,558,281)
test read_f16_as_f32_zip_nthreads ... bench: 12,881,160 ns/iter (+/- 4,175,510)
test read_f16_as_f16_zip_1thread ... bench: 28,832,260 ns/iter (+/- 2,584,587)
test read_f16_as_f32_zip_1thread ... bench: 26,279,960 ns/iter (+/- 2,138,992)
test read_f32_as_f32_uncompressed_1thread ... bench: 17,843,730 ns/iter (+/- 1,008,381)
test read_f32_as_u32_uncompressed_1thread ... bench: 17,952,880 ns/iter (+/- 2,185,665)
test read_f32_as_f16_uncompressed_1thread ... bench: 17,965,450 ns/iter (+/- 2,524,674)
test read_f32_as_f32_zips_nthreads ... bench: 26,873,920 ns/iter (+/- 3,032,381)
test read_f32_as_f16_zips_nthreads ... bench: 26,641,840 ns/iter (+/- 2,400,515)
test read_f32_as_f32_zips_1thread ... bench: 101,547,150 ns/iter (+/- 6,313,799)
test read_f32_as_f16_zips_1thread ... bench: 100,998,820 ns/iter (+/- 6,737,638)
previously (without SIMD batching, but with intrinsic conversions)
test read_f16_as_f16_uncompressed_1thread ... bench: 13,896,960 ns/iter (+/- 1,866,398)
test read_f16_as_u32_uncompressed_1thread ... bench: 13,760,660 ns/iter (+/- 583,555)
test read_f16_as_f32_uncompressed_1thread ... bench: 13,805,060 ns/iter (+/- 1,905,708)
test read_f16_as_f16_zip_nthreads ... bench: 14,468,520 ns/iter (+/- 1,170,083)
test read_f16_as_f32_zip_nthreads ... bench: 14,479,990 ns/iter (+/- 4,490,935)
test read_f16_as_f16_zip_1thread ... bench: 29,224,890 ns/iter (+/- 1,293,434)
test read_f16_as_f32_zip_1thread ... bench: 29,319,380 ns/iter (+/- 826,762)
test read_f32_as_f32_uncompressed_1thread ... bench: 30,926,660 ns/iter (+/- 2,187,303)
test read_f32_as_u32_uncompressed_1thread ... bench: 30,900,850 ns/iter (+/- 4,375,285)
test read_f32_as_f16_uncompressed_1thread ... bench: 30,854,990 ns/iter (+/- 1,294,175)
test read_f32_as_f32_zips_nthreads ... bench: 48,464,580 ns/iter (+/- 7,056,668)
test read_f32_as_f16_zips_nthreads ... bench: 48,596,240 ns/iter (+/- 5,171,012)
test read_f32_as_f32_zips_1thread ... bench: 113,928,800 ns/iter (+/- 7,434,780)
test read_f32_as_f16_zips_1thread ... bench: 113,377,860 ns/iter (+/- 5,173,657)
(sorry for not merging yet, I'm abusing this branch to fix the github workflow. the CI should have catched the MSRV breaking change, but it is broken apparently)
Fixed it - now the only question is whether we want to go 2.0.0 and Rust 1.70.0 for this...
As it stands, Cargo.lock
does require Rust 1.70 but Cargo.toml
does not, meaning that downstream users are free to configure the library to use older half
with an older MSRV if they need to. I think that's a fair balance. It would be nice to call it out in the README.
If we allow half = "2.3"
, we should also raise our own rust-version = 1.70.0
in the Cargo.toml, right? Do you mean we do that, and also hint at a workaround? The workaround being our users specify an older version of half and can then compile using rustc 1.59? This makes sense, I didn't think of that, good idea :)
You can put half = 2.2
into Cargo.toml
, so when someone adds exrs
as a dependency it will default to the latest at the time of the installation (currently 2.3.1) but will also allow downgrading to 2.2 if this is needed by the users for MSRV reasons.
And just don't put rust-version
in there I guess, so you could have a Cargo.lock
for development with the latest half
for best performance, and also if anyone wants to run the benchmarks on the repo.
The Cargo.lock
is no longer in the repo, as is suggested for Rust libraries. But anyways, the plan still sounds good. I'll find out what the MSRV in the Cargo.toml
actually means, and decide whether to put the Rust version into the Cargo.toml
. Thanks for your help with this PR :)
Actually, let's merge all of this except for the version upgrade of half. then release a major version with the small performance improvements. then release 2.0.0 with the new version of half, including the new intrinsics, and a new msrv.
the reason being that the batching alone gives us 10% speed improvement (measured with intrinsics active, assuming it will also be relevant without intrinsics)
sorry for all the discussion and for all the strategy changes :D
cargo-msrv verify
succeeds on my local machine... ci seems to be broken still
I am convinced that bumping semver for MSRV reasons alone is a bad idea, because now several crates using exr
have to all manually upgrade in tandem. If one uses exr
1.x and the other uses 2.x they can no longer interoperate; and anyone can upgrade to 2.x only when everyone has upgraded to 2.x, splitting the ecosystem. Please don't release 2.x just because of an MSRV change.
(and also fix round up division missing documentation)
@Shnatsel will this approach work in terms of optimization? i had to add a few copy operations for technical reasons
to do:
half
dependency