batched f16 conversion - Githubissues

johannesvollmer commented 1 year ago

(and also fix round up division missing documentation)

@Shnatsel will this approach work in terms of optimization? i had to add a few copy operations for technical reasons

to do:

[x] unit tests
[x] before and after benchmark
[x] add documentation about how to unlock all the optimizations (compiler flags and such)
[x] use newest half dependency
[x] clean up and refactor
[x] improve documentation of the functions

johannesvollmer commented 1 year ago

before merging i want to have unit tests for that function and i want to clean it up, deduplicate the code, make it rusty

johannesvollmer commented 1 year ago

also i wonder whether the batch size of 4 can allow the compiler to optimize away all of the chunking logic in HalfFloatSliceExt::convert_from_f32_slice? Unfortunately we need to define a batch size, in order to avoid allocating a temporary buffer on the heap

Shnatsel commented 1 year ago

Benchmarks: with half = {git = "https://github.com/starkat99/half-rs", features = ["use-intrinsics"]} in Cargo.toml I'm seeing the conversion time drop so much that it becomes unnoticeable, indistinguishable from the regular f32 to f32 codepath.

The f32 to u32 path doesn't get any faster, however. Maybe autovectorization doesn't kick in. I've tried with -C target-cpu=native for comparison, no difference.

Shnatsel commented 1 year ago

ARM has native instructions for casting from f32 to unsigned integers (details). This would explain why I'm seeing good results for f32 to u32 conversion on ARM, even without this change.

I haven't found any mentions of native f32 to u32 casts on x86. ChatGPT mentioned some but I think it's making it up, and when I corrected it, it said this:

Снимок экрана от 2023-01-07 19-26-25

Or here's a human discussion of a similar conversion (albeit to u8): https://stackoverflow.com/questions/29856006/sse-intrinsics-convert-32-bit-floats-to-unsigned-8-bit-integers

Shnatsel commented 1 year ago

This also regresses the fallback path on half 2.2.0 but not on the latest git; we will need to switch to latest half so that we don't introduce regressions when intrinsics are not available.

Shnatsel commented 1 year ago

Yeah, there seems to be no native conversion from f32 to u32 on x86_64 without AVX-512: https://stackoverflow.com/questions/41144668/how-to-efficiently-perform-double-int64-conversions-with-sse-avx

There is a conversion from f32 to i32, but that would truncate large numbers that fit into u32 but not i32, and therefore produce incorrect results (although u32 cannot represent all f32 values in the first place).

Why is u32 is the chosen format, anyway?

johannesvollmer commented 1 year ago

Thanks for the feedback! Conversion from and to u32 are only for completeness. I don't expect it to ever happen in the real world. We should not worry about inferior performance here at all. As you said, it's not even accurate for large numbers.

johannesvollmer commented 1 year ago

I'm seeing the conversion time drop so much that it becomes unnoticeable, indistinguishable from the regular f32 to f32 codepath.

Awesome, didn't expect that much of a speed up!

johannesvollmer commented 1 year ago

Do we have any regression concerning f32 -> f32?

johannesvollmer commented 1 year ago

Why is u32 is the chosen format, anyway?

the purpose of u32 samples is too assign different areas in the image unique IDs. The common case, rgba images, are either full f32 or full f16.

In the cases where u32 is used, it is certainly planned and not converted to any float type.

Shnatsel commented 1 year ago

Do we have any regression concerning f32 -> f32?

Nope, it's the exact same on my machine. I guess the buffer does fit entirely into the L1 cache, it's not big.

johannesvollmer commented 1 year ago

added a Todo list in the pr text. anything else to add to that list?

Shnatsel commented 1 year ago

The necessary changes to half have shipped in version 2.2.1

johannesvollmer commented 1 year ago

The code is not pretty, but it will be good enough for the moment. If there is no performance regression, I will merge it. @Shnatsel do you want to make sure it is still properly optimized after the refactorings?

Shnatsel commented 1 year ago

You should see the difference on cargo bench yourself, but sure, I can take a look over the weekend.

johannesvollmer commented 1 year ago

Yes, I was planning to see the benchmarks, sure :)

johannesvollmer commented 1 year ago

Don't worry, take the time you need, does not have to be this weekend :)

johannesvollmer commented 1 year ago

Strangely, on my machine, there is no performance gain whatsoever, rather within the margin of error, it is the same, if not worse. Note that this is the first time I benchmarked the code for this PR. All benchmarks used the newest version of half, I pulled only the function implementation from different commits. Ran cargo bench on a windows 11 machine with an intel i7 8th gen, most up-to-date rustc.

# no batch conversion
read_image_rgba_f32_to_f32 ... bench:  19,554,980 ns/iter (+/- 2,367,993)
read_image_rgba_f32_to_f16 ... bench:  40,168,210 ns/iter (+/- 1,880,333)
read_image_rgba_f32_to_u32 ... bench:  23,609,890 ns/iter (+/- 2,525,909)

# with batch conversion, dirty
read_image_rgba_f32_to_f32 ... bench:  19,643,090 ns/iter (+/- 1,126,170)
read_image_rgba_f32_to_f16 ... bench:  45,213,290 ns/iter (+/- 2,912,926)
read_image_rgba_f32_to_u32 ... bench:  26,500,130 ns/iter (+/- 2,043,967)

# with batch conversion, refactored
read_image_rgba_f32_to_f32 ... bench:  19,653,510 ns/iter (+/- 2,083,299)
read_image_rgba_f32_to_f16 ... bench:  43,848,780 ns/iter (+/- 2,427,347)
read_image_rgba_f32_to_u32 ... bench:  26,658,960 ns/iter (+/- 3,115,560)

johannesvollmer commented 1 year ago

if there is something the library user/me needs to do for good performance, this is the perfect time for some documentation :)

some flags anywhere? maybe target cpu set to native? maybe a feature flag in half?

Shnatsel commented 1 year ago

Yes, you need to enable the use-intrinsics feature in half for performance. It currently requires nightly, but will be stabilized in the upcoming Rust 1.68.

It's possible to make the implementation for 64-bit ARM to work on stable, but that's not yet merged into half: https://github.com/starkat99/half-rs/issues/63

Shnatsel commented 1 year ago

Latest code by default:

test read_image_rgba_f32_to_f16 ... bench:  38,017,728 ns/iter (+/- 132,337)

With the use-intrinsics feature:

test read_image_rgba_f32_to_f16 ... bench:  14,419,196 ns/iter (+/- 146,359)

Also there's no need to set -Ctarget-cpu=native, since half performs runtime feature detection.

Shnatsel commented 1 year ago

ARM support is coming in the next half release that will require Rust 1.69: https://github.com/starkat99/half-rs/issues/63#issuecomment-1399847097

johannesvollmer commented 1 year ago

I propose we introduce a feature flag to exr. The feature flag will enable use-intrinsics in half, but may be used for other purposes in the future. This way, people can still use the newest library with a slightly older old compiler. Maybe a flag of the name nightly? or "bleeding-edge"?

Shnatsel commented 1 year ago

I suggest we simply wait for the next half release and switch to it once it's released. I understand it won't require any additional features.

I don't think exposing an exr feature that's only going to be relevant until Rust 1.69 ships is a good idea. Better document the workaround with depending on half and enabling the feature in it in the interim.

Shnatsel commented 1 year ago

For context, half is going to do a major version bump with the stabilization of x86 intrinsics and the addition of ARM intrinsics. I don't think there's a way to select a major version of a library depending on a feature.

Shnatsel commented 1 year ago

Speaking of, since it will take until Rust 1.69 to reap the benefits of this change on stable, it might be better to ship to make a new release without this PR.

We can merge it immediately afterwards, and cut a new release once Rust 1.69 has shipped.

johannesvollmer commented 1 year ago

Do I understand correctly that the newest Rust version is only required when we enable the use-intrinsics flag in half? In that case, if we leave the decision up to the user, it will be fine to just update the half dependency and let the people who have access to nightly Rust enable the flag.

Shnatsel commented 1 year ago

Yes, you understand correctly - that is the behavior of the currently released 2.2.1 version.

A version requiring 1.69 will ship at a later date and will likely either enable the feature by default or remove it outright. It will also come with a major semver bump in half.

johannesvollmer commented 1 year ago

aight, Rust 1.68 is here now!

you need to enable the use-intrinsics feature in half for performance. It currently requires nightly, but will be stabilized in the upcoming Rust 1.68.

I'll try to find out whether we need a new release of half for the stable feature flag

Shnatsel commented 1 year ago

I believe stabilization of the required intrinsics got pushed back to 1.69, let me double-check that.

johannesvollmer commented 1 year ago

yes, I think I didn't see any intrinsic being mentioned in the changelog

Shnatsel commented 1 year ago

So the stabilization PR is https://github.com/rust-lang/stdarch/pull/1366

It clearly didn't ship in 1.68

Shnatsel commented 1 year ago

half 2.3.1 got rid of the feature flag and enabled the use of intrinsics by default, so no changes to half feature flags are needed on our side. This PR should be good to go, just needs unit tests.

johannesvollmer commented 1 year ago

great news, now we finally see the expected speedups in stable without setting a flag :) nice

cargo bench pixel format conversions
new read_image_rgba_f32_to_f16 ... bench:  18,413,400 ns/iter (+/- 2,459,747)
old read_image_rgba_f32_to_f16 ... bench:  29,149,950 ns/iter (+/- 3,587,253)

These big speedups apply under the following conditions:

no compression used
file is in memory, already loaded from disk
file contains f32 data, but you request f16 data, or the other way round

In other situations, the speedups will be smaller, if any.

The other formats seem to have no significant regression:

new test read_image_rgba_f32_to_f32 ... bench:  20,100,990 ns/iter (+/- 2,852,151)
old test read_image_rgba_f32_to_f32 ... bench:  19,364,750 ns/iter (+/- 2,143,720)

new test read_image_rgba_f32_to_u32 ... bench:  26,969,190 ns/iter (+/- 2,512,176)
old test read_image_rgba_f32_to_u32 ... bench:  26,253,080 ns/iter (+/- 3,488,182)

johannesvollmer commented 1 year ago

Thanks for the review :)

I forced half to the newest version, to avoid regressions, and added a unit test. Ready to merge. Any objections?

Shnatsel commented 1 year ago

I would require half 2.2 but put 2.3.1 in the Cargo.lock to avoid breaking existing users by enforcing a higher MSRV, and add a note to the README about performance.

IMO a slower f16 conversion is not as critical as the code failing to build in the first place.

Shnatsel commented 1 year ago

no compression used

I believe I changed those tests to read to f32 rather than f16, which is why you're not seeing a speedup there. If you change them back to f16 you will see a 9,000,000 ns/iter difference there too, because the format conversion is single-threaded.

Shnatsel commented 1 year ago

I don't have a strong opinion on bumping half, however.

I have no objections to merging this PR.

johannesvollmer commented 1 year ago

okay :)

one more observation about the benchmarks: reading an image with f16 to f32 conversion is now as fast as reading f32 to f32 (no conversion). awesome!

johannesvollmer commented 1 year ago

Good point about the MSRV. Didn't know that it changed. This repo has a CI test that checks for the MSRV still being valid - it seems this doesn't work correctly

johannesvollmer commented 1 year ago

Added more benchmarks, everything looks as expected still. neat!

test read_f16_as_f16_uncompressed_1thread ... bench:  11,665,580 ns/iter (+/- 2,527,286)
test read_f16_as_u32_uncompressed_1thread ... bench:  11,732,710 ns/iter (+/- 957,454)
test read_f16_as_f32_uncompressed_1thread ... bench:  11,661,750 ns/iter (+/- 716,012)
test read_f16_as_f16_zip_nthreads         ... bench:  13,345,020 ns/iter (+/- 1,558,281)
test read_f16_as_f32_zip_nthreads         ... bench:  12,881,160 ns/iter (+/- 4,175,510)
test read_f16_as_f16_zip_1thread          ... bench:  28,832,260 ns/iter (+/- 2,584,587)
test read_f16_as_f32_zip_1thread          ... bench:  26,279,960 ns/iter (+/- 2,138,992)

test read_f32_as_f32_uncompressed_1thread ... bench:  17,843,730 ns/iter (+/- 1,008,381)
test read_f32_as_u32_uncompressed_1thread ... bench:  17,952,880 ns/iter (+/- 2,185,665)
test read_f32_as_f16_uncompressed_1thread ... bench:  17,965,450 ns/iter (+/- 2,524,674)
test read_f32_as_f32_zips_nthreads        ... bench:  26,873,920 ns/iter (+/- 3,032,381)
test read_f32_as_f16_zips_nthreads        ... bench:  26,641,840 ns/iter (+/- 2,400,515)
test read_f32_as_f32_zips_1thread         ... bench: 101,547,150 ns/iter (+/- 6,313,799)
test read_f32_as_f16_zips_1thread         ... bench: 100,998,820 ns/iter (+/- 6,737,638)

previously (without SIMD batching, but with intrinsic conversions)

test read_f16_as_f16_uncompressed_1thread ... bench:  13,896,960 ns/iter (+/- 1,866,398)
test read_f16_as_u32_uncompressed_1thread ... bench:  13,760,660 ns/iter (+/- 583,555)
test read_f16_as_f32_uncompressed_1thread ... bench:  13,805,060 ns/iter (+/- 1,905,708)
test read_f16_as_f16_zip_nthreads         ... bench:  14,468,520 ns/iter (+/- 1,170,083)
test read_f16_as_f32_zip_nthreads         ... bench:  14,479,990 ns/iter (+/- 4,490,935)
test read_f16_as_f16_zip_1thread          ... bench:  29,224,890 ns/iter (+/- 1,293,434)
test read_f16_as_f32_zip_1thread          ... bench:  29,319,380 ns/iter (+/- 826,762)

test read_f32_as_f32_uncompressed_1thread ... bench:  30,926,660 ns/iter (+/- 2,187,303)
test read_f32_as_u32_uncompressed_1thread ... bench:  30,900,850 ns/iter (+/- 4,375,285)
test read_f32_as_f16_uncompressed_1thread ... bench:  30,854,990 ns/iter (+/- 1,294,175)
test read_f32_as_f32_zips_nthreads        ... bench:  48,464,580 ns/iter (+/- 7,056,668)
test read_f32_as_f16_zips_nthreads        ... bench:  48,596,240 ns/iter (+/- 5,171,012)
test read_f32_as_f32_zips_1thread         ... bench: 113,928,800 ns/iter (+/- 7,434,780)
test read_f32_as_f16_zips_1thread         ... bench: 113,377,860 ns/iter (+/- 5,173,657)

johannesvollmer commented 1 year ago

(sorry for not merging yet, I'm abusing this branch to fix the github workflow. the CI should have catched the MSRV breaking change, but it is broken apparently)

johannesvollmer commented 1 year ago

Fixed it - now the only question is whether we want to go 2.0.0 and Rust 1.70.0 for this...

Shnatsel commented 1 year ago

As it stands, Cargo.lock does require Rust 1.70 but Cargo.toml does not, meaning that downstream users are free to configure the library to use older half with an older MSRV if they need to. I think that's a fair balance. It would be nice to call it out in the README.

johannesvollmer commented 1 year ago

If we allow half = "2.3", we should also raise our own rust-version = 1.70.0 in the Cargo.toml, right? Do you mean we do that, and also hint at a workaround? The workaround being our users specify an older version of half and can then compile using rustc 1.59? This makes sense, I didn't think of that, good idea :)

Shnatsel commented 1 year ago

You can put half = 2.2 into Cargo.toml, so when someone adds exrs as a dependency it will default to the latest at the time of the installation (currently 2.3.1) but will also allow downgrading to 2.2 if this is needed by the users for MSRV reasons.

And just don't put rust-version in there I guess, so you could have a Cargo.lock for development with the latest half for best performance, and also if anyone wants to run the benchmarks on the repo.

johannesvollmer commented 1 year ago

The Cargo.lock is no longer in the repo, as is suggested for Rust libraries. But anyways, the plan still sounds good. I'll find out what the MSRV in the Cargo.toml actually means, and decide whether to put the Rust version into the Cargo.toml. Thanks for your help with this PR :)

johannesvollmer commented 1 year ago

Actually, let's merge all of this except for the version upgrade of half. then release a major version with the small performance improvements. then release 2.0.0 with the new version of half, including the new intrinsics, and a new msrv.

the reason being that the batching alone gives us 10% speed improvement (measured with intrinsics active, assuming it will also be relevant without intrinsics)

sorry for all the discussion and for all the strategy changes :D

johannesvollmer commented 1 year ago

cargo-msrv verify succeeds on my local machine... ci seems to be broken still

Shnatsel commented 1 year ago

I am convinced that bumping semver for MSRV reasons alone is a bad idea, because now several crates using exr have to all manually upgrade in tandem. If one uses exr 1.x and the other uses 2.x they can no longer interoperate; and anyone can upgrade to 2.x only when everyone has upgraded to 2.x, splitting the ecosystem. Please don't release 2.x just because of an MSRV change.

johannesvollmer / exrs

batched f16 conversion #191