Closed kpp closed 6 years ago
Hmm, it feels to me that +/- 1 LSB should be quite acceptable for the purposes of font rendering, and that allowing slight differences in roundoff is a reasonable tradeoff for faster computation. It also seems to me that the use of FP arithmetic basically makes it impossible to bit-exact match the scalar code in all cases. Would you consider adapting the test to allow small errors?
What is LSB?
And do you think that it your case [25, 76, 153, 255]
should be equal to [26, 76, 153, 255]
? Or [25, 76]
should be equal to [26, 76]
with input [0.1, 0.2]
?
Least significant bit. Here's a random Stack Overflow thread that discusses the fact that it's bad practice to directly compare floats for equality; this is equally true when doing a floating point comptutation.
Obviously [25, 76, 153, 255]
is not equal to [26, 76, 153, 255]
, but I'm saying that this pair of values should pass the test, ie the standard should be 'equality within epsilon' rather than simply equality.
Alright. [25]
should be equal to [26]
. The previous computation is * 255.0
. In this case I would say that in simple and straightforward impl acc was 25/255 = 0.098
while in C version it was 26/255 = 0.101
. It does not look like they are 'equal within epsilon'. See https://doc.rust-lang.org/std/f32/constant.EPSILON.html
As you can see, I changed C version a little bit:
__m128 y = _mm_andnot_ps(sign_mask, x); // fabs(x)
y = _mm_min_ps(y, _mm_set1_ps(1.0f));
- y = _mm_mul_ps(y, _mm_set1_ps(255.0f));
+ y = _mm_mul_ps(y, _mm_set1_ps(254.5000077f)); // WTF
__m128i z = _mm_cvtps_epi32(y);
z = _mm_shuffle_epi8(z, mask);
And it fixed 6 of 7 tests.
You'd better revise your code. Or you may rewrite it using https://github.com/AdamNiederer/faster.
@AdamNiederer we got unit tests for backward compatibility. Would you be so kind to help us migrate from C to you faster
library?
Sure, I'll take a look sometime this week.
That doesn't sound like the correct fix, multiplying by 255 is what I expect the logic to do, this constant makes no sense. As I stated before, I think the correct thing to do is write the test so it can accept a delta of +/- 1 between the two results.
And of course I'd love to see a pure-Rust simd version :)
I added more tests. Unfortunately fuzzer
test segfaults... 🤣 As always.
That doesn't sound like the correct fix, multiplying by 255 is what I expect the logic to do
Alright. I reverted your * 255.0
. It gives us 0 of 17 success tests.
But! I found a bug in your code:
(255.0 * y) as u8 // floor()
y = _mm_mul_ps(y, _mm_set1_ps(255.0f));
__m128i z = _mm_cvtps_epi32(y); // round()
To fix that you have to use _mm_cvttps_epi32
instead of _mm_cvtps_epi32
. It gives us 16 of 17 success tests. fuzzer
passes without segfaults in 1 case of 20 runs.
Another solution will be:
- (255.0 * y) as u8
+ (255.0 * y).round() as u8
And that gives us:
running 17 tests
test accumulate::tests::max_255_from_0_015625 ... ok
test accumulate::tests::max_255_from_0_0625 ... ok
test accumulate::tests::max_255_from_0_03125 ... ok
test accumulate::tests::max_255_from_0_125 ... ok
test accumulate::tests::max_255_from_0_5 ... ok
test accumulate::tests::max_255_from_0_25 ... ok
test accumulate::tests::max_255_from_0_0078125 ... ok
test accumulate::tests::simple_0 ... ok
test accumulate::tests::max_255_from_1_0 ... ok
test accumulate::tests::simple_1 ... ok
test accumulate::tests::simple_2 ... FAILED
test accumulate::tests::simple_5 ... FAILED
test accumulate::tests::simple_3 ... FAILED
test accumulate::tests::simple_4 ... FAILED
test accumulate::tests::simple_6 ... FAILED
test accumulate::tests::simple_7 ... FAILED
fuzzer segfault
Which one do you prefer?
It looks like faster
isn't very well-suited to this type of computation, as you're doing a ton of bitcasting and arbitrary shuffling to get rid of the data dependency on acc
and downcast the vector of f32s
to u8s
. I could probably provide a faster-stdsimd hybrid implementation, but it will still be pretty ugly and unsafe.
@BurntSushi Hi.
With RUST_LOG="quickcheck/info"
tests do not segfault(I know that I should have written RUST_LOG="quickcheck=info"
but I made a typo), but fuzzer finds a bug in 3 cases of 100 runs.
And without RUST_LOG
env var set or with RUST_LOG="quickcheck=info"
tests segfault 19 times of 20 runs right here:
#[test]
fn fuzzer() {
QuickCheck::new().quickcheck(test_accumulate as fn(Vec<f32>));
}
Here is the link to Travis build without RUST_LOG
env set: https://travis-ci.org/google/font-rs/builds/335348076?utm_source=github_status&utm_medium=notification
Here is the link to Travis build with RUST_LOG="quickcheck/info"
: https://travis-ci.org/google/font-rs/builds/335349903?utm_source=github_status&utm_medium=notification
Would you please help us investigate what is going on?
but it will still be pretty ugly and unsafe.
@AdamNiederer that is fine if your implementation will pass tests)))
ready to merge
Thanks for updating this. I have some thoughts about how I might want to tweak rounding and testing, but after thinking about it and then not having the energy to dig in on this PR, I think the best thing to do is merge this and then try to do followup later.
Describing exactly what I want the rounding behavior to be is not trivial. I want 0 and 255 to be reliable (ie I don't want to see 254/255 toggling in filled areas because it's right on a rounding threshold). I also want the code (whether scalar or simd) to be (a) the fastest possible with up to 1 lsb of error, and (b) within the class of equivalently fast code, as accurate as possible. That's why the SIMD code rounded rather than truncating (changing the rounding mode is free) and the scalar code truncated (adding the additional 0.5 to round would not be free, though perhaps I'd need to take a closer look at the asm to figure out whether that's really true). Ideally the tests would accommodate both behaviors.
I expect we might revisit this now that SIMD is standardizing. The main point of this note is to document that I don't consider the current tests an ironclad contract, but something that can be adapted to accommodate higher performance.
Thanks for moving this forward, and sorry for the long delays in reviewing.
Move
fn accumulate
into a separate mod.Add tests for fn accumulate.