Implement adjustable test threshold

rpspringuel commented 6 years ago

Adds a new user settable parameter THRESHOLD which determines how strict the comparison has to be to pass an image comparison test. THRESHOLD takes on values of 0 to 1 with a 1 being a perfect pixel-by-pixel match. To facilitate this new setting, test of comparison result metric vs. THRESHOLD is now done using bc for floating point precision. This replaces previous coercion of value to integer (round to 3 decimal places and multiply by 1000) which enabled bash's internal integer comparison operators to do the test. The default value of 0.9985 is designed to imitate the previous test limit of 999 (it's the smallest value that would be coerced to 999).

rpspringuel commented 6 years ago

There are also a couple of tests which are updated in this PR.

gabc-output/glyphs/pitches was actually failing at 300dpi (but not at 150dpi, where I usually run my tests) before the change and thus continued to fail with the default value for THRESHOLD

gabc-output/glyphs/broken_shapes -- This test changed due to the new oriscus direction logic, but the change is only to one glyph and thus was missed before(see gregorio-project/gregorio#1412). A THRESHOLD value of 0.998632 was required to catch the difference at 300dpi. The difference could also be caught at 150dpi, but would require a THRESHOLD of 0.998747.

henryso commented 6 years ago

I think THRESHOLD is too general and should be something more specific like IMAGE_COMPARE_THRESHOLD. Have you tried comparing same version of TeX Live across different operating systems? That's the reason compare is using a fuzzy number (i.e., not a 100% match). Also realize that calling out bc is around 100 times slower than using something built-in to the scripting language (and around 10 times slower than something that uses floating point rather than infinite precision registers), which due to your machine, was the reason I didn't use bc in the first place.

rpspringuel commented 6 years ago

I think THRESHOLD is too general and should be something more specific like IMAGE_COMPARE_THRESHOLD.

Easy enough to change.

Have you tried comparing same version of TeX Live across different operating systems? That's the reason compare is using a fuzzy number (i.e., not a 100% match).

Not yet. I didn't have time for that last night. Given that the new test imitates the old test at the default setting, I don't expect a problem.

Also realize that calling out bc is around 100 times slower than using something built-in to the scripting language (and around 10 times slower than something that uses floating point rather than infinite precision registers), which due to your machine, was the reason I didn't use bc in the first place.

Didn't realize that and didn't really notice it when testing the change. It may be that the earlier fix to allow parallel processing (which was a 4 fold increase in speed for me) and the fact that I've now got a new, more powerful machine has made me less sensitive to processor time. Still, I'll do some testing on that front too.

henryso commented 6 years ago

If you didn't notice a performance problem, then that's good enough for me. Maybe one day, the harness could be rewritten in a scripting language that can access libraries to do these things in-process to save some time.

rpspringuel commented 6 years ago

Okay, I just did some basic performance testing on the use of bc vs. a pure bash solution (taken from the same question where I got the bc method) for comparing the numbers. As you indicated bc is about 100 times slower than pure bash but my new machine only takes about 30s to run 10,000 of these comparisons. We have 141 image comparison tests at the moment, so the rough processor time spent on the bc comparisons is 0.45s if all the tests are run sequentially. With multiple processors running 4 tests in parallel, we're talking about 0.11s. That strikes me as being an unnoticeable gain now to reduce that by a factor of 100.

Next up, cross platform testing.

rpspringuel commented 6 years ago

Okay, other than the test fails that came from running the 5.1.1 tests against 5.0.2, testing worked on Ubuntu.

gregorio-project / gregorio-test

Implement adjustable test threshold #325