raphlinus / font-rs

Apache License 2.0
755 stars 49 forks source link

Tests for accumulate #22

Closed kpp closed 6 years ago

kpp commented 6 years ago

Move fn accumulate into a separate mod.

Add tests for fn accumulate.

raphlinus commented 6 years ago

Hmm, it feels to me that +/- 1 LSB should be quite acceptable for the purposes of font rendering, and that allowing slight differences in roundoff is a reasonable tradeoff for faster computation. It also seems to me that the use of FP arithmetic basically makes it impossible to bit-exact match the scalar code in all cases. Would you consider adapting the test to allow small errors?

kpp commented 6 years ago

What is LSB?

And do you think that it your case [25, 76, 153, 255] should be equal to [26, 76, 153, 255]? Or [25, 76] should be equal to [26, 76] with input [0.1, 0.2]?

raphlinus commented 6 years ago

Least significant bit. Here's a random Stack Overflow thread that discusses the fact that it's bad practice to directly compare floats for equality; this is equally true when doing a floating point comptutation.

Obviously [25, 76, 153, 255] is not equal to [26, 76, 153, 255], but I'm saying that this pair of values should pass the test, ie the standard should be 'equality within epsilon' rather than simply equality.

kpp commented 6 years ago

Alright. [25] should be equal to [26]. The previous computation is * 255.0. In this case I would say that in simple and straightforward impl acc was 25/255 = 0.098 while in C version it was 26/255 = 0.101. It does not look like they are 'equal within epsilon'. See https://doc.rust-lang.org/std/f32/constant.EPSILON.html

kpp commented 6 years ago

As you can see, I changed C version a little bit:

     __m128 y = _mm_andnot_ps(sign_mask, x);  // fabs(x)
     y = _mm_min_ps(y, _mm_set1_ps(1.0f));
-    y = _mm_mul_ps(y, _mm_set1_ps(255.0f));
+    y = _mm_mul_ps(y, _mm_set1_ps(254.5000077f)); // WTF
     __m128i z = _mm_cvtps_epi32(y);
     z = _mm_shuffle_epi8(z, mask);

And it fixed 6 of 7 tests.

You'd better revise your code. Or you may rewrite it using https://github.com/AdamNiederer/faster.

@AdamNiederer we got unit tests for backward compatibility. Would you be so kind to help us migrate from C to you faster library?

AdamNiederer commented 6 years ago

Sure, I'll take a look sometime this week.

raphlinus commented 6 years ago

That doesn't sound like the correct fix, multiplying by 255 is what I expect the logic to do, this constant makes no sense. As I stated before, I think the correct thing to do is write the test so it can accept a delta of +/- 1 between the two results.

And of course I'd love to see a pure-Rust simd version :)

kpp commented 6 years ago

I added more tests. Unfortunately fuzzer test segfaults... 🤣 As always.

That doesn't sound like the correct fix, multiplying by 255 is what I expect the logic to do

Alright. I reverted your * 255.0. It gives us 0 of 17 success tests.

But! I found a bug in your code:

    (255.0 * y) as u8 // floor()
    y = _mm_mul_ps(y, _mm_set1_ps(255.0f));
    __m128i z = _mm_cvtps_epi32(y); // round()

To fix that you have to use _mm_cvttps_epi32 instead of _mm_cvtps_epi32. It gives us 16 of 17 success tests. fuzzer passes without segfaults in 1 case of 20 runs.

Another solution will be:

-            (255.0 * y) as u8
+            (255.0 * y).round() as u8

And that gives us:

running 17 tests
test accumulate::tests::max_255_from_0_015625 ... ok
test accumulate::tests::max_255_from_0_0625 ... ok
test accumulate::tests::max_255_from_0_03125 ... ok
test accumulate::tests::max_255_from_0_125 ... ok
test accumulate::tests::max_255_from_0_5 ... ok
test accumulate::tests::max_255_from_0_25 ... ok
test accumulate::tests::max_255_from_0_0078125 ... ok
test accumulate::tests::simple_0 ... ok
test accumulate::tests::max_255_from_1_0 ... ok
test accumulate::tests::simple_1 ... ok
test accumulate::tests::simple_2 ... FAILED
test accumulate::tests::simple_5 ... FAILED
test accumulate::tests::simple_3 ... FAILED
test accumulate::tests::simple_4 ... FAILED
test accumulate::tests::simple_6 ... FAILED
test accumulate::tests::simple_7 ... FAILED
fuzzer segfault

Which one do you prefer?

AdamNiederer commented 6 years ago

It looks like faster isn't very well-suited to this type of computation, as you're doing a ton of bitcasting and arbitrary shuffling to get rid of the data dependency on acc and downcast the vector of f32s to u8s. I could probably provide a faster-stdsimd hybrid implementation, but it will still be pretty ugly and unsafe.

kpp commented 6 years ago

@BurntSushi Hi.

With RUST_LOG="quickcheck/info" tests do not segfault(I know that I should have written RUST_LOG="quickcheck=info" but I made a typo), but fuzzer finds a bug in 3 cases of 100 runs. And without RUST_LOG env var set or with RUST_LOG="quickcheck=info" tests segfault 19 times of 20 runs right here:

    #[test]
    fn fuzzer() {
        QuickCheck::new().quickcheck(test_accumulate as fn(Vec<f32>));
    }

Here is the link to Travis build without RUST_LOG env set: https://travis-ci.org/google/font-rs/builds/335348076?utm_source=github_status&utm_medium=notification Here is the link to Travis build with RUST_LOG="quickcheck/info": https://travis-ci.org/google/font-rs/builds/335349903?utm_source=github_status&utm_medium=notification

Would you please help us investigate what is going on?

kpp commented 6 years ago

but it will still be pretty ugly and unsafe.

@AdamNiederer that is fine if your implementation will pass tests)))

kpp commented 6 years ago

ready to merge

raphlinus commented 6 years ago

Thanks for updating this. I have some thoughts about how I might want to tweak rounding and testing, but after thinking about it and then not having the energy to dig in on this PR, I think the best thing to do is merge this and then try to do followup later.

Describing exactly what I want the rounding behavior to be is not trivial. I want 0 and 255 to be reliable (ie I don't want to see 254/255 toggling in filled areas because it's right on a rounding threshold). I also want the code (whether scalar or simd) to be (a) the fastest possible with up to 1 lsb of error, and (b) within the class of equivalently fast code, as accurate as possible. That's why the SIMD code rounded rather than truncating (changing the rounding mode is free) and the scalar code truncated (adding the additional 0.5 to round would not be free, though perhaps I'd need to take a closer look at the asm to figure out whether that's really true). Ideally the tests would accommodate both behaviors.

I expect we might revisit this now that SIMD is standardizing. The main point of this note is to document that I don't consider the current tests an ironclad contract, but something that can be adapted to accommodate higher performance.

Thanks for moving this forward, and sorry for the long delays in reviewing.