Conformance testing of ML APIs for the Web

anssiko commented 3 years ago

@wchao1115 mentions in his talk on DirectML the importance of interoperability across a broad variety of hardware:

The goal of Direct ML is to provide best performance by leveraging the latest hardware features in modern PC while providing an implementation that work reliably in our different hardware platforms, old and new.

We put together a robust conformance test and driver certification process to ensure a high degree of consistency.

So a model works the same way on any windows PC.

The web is known for its focus on interoperability that gives browsers confidence that they are shipping software which is compatible with other implementations, and that later implementations will be compatible with their implementations.

The web interoperability means browsers must behave in predictable ways across different hardware, platforms, and OSes. The web is arguably the platform with the most diverse client installed base, which in turn has motivated approaches such as Progressive Enhancement / Graceful degradation discussed in #68.

Couple of questions:

Are there learnings or best practices from DirectML conformance testing efforts or elsewhere in native ecosystem that could be beneficial in the web-platform-tests for related Web APIs?
Can some of the existing infrastructure be reused or repurposed for web-platform-tests?
What are the special considerations in testing a graph API vs model loader API?

wchao1115 commented 3 years ago

One of the challenges of conformance testing ML operations is around numerical integrity of the computed test result of floating point calculations. Unlike fixed-point or integral computation, floating point calculated values are never exactly definitive, and that the best one can know is whether the two computed values are close enough to call them semantically equal. In a problem space such as ML where the semantic behaviors of operations are defined but are rarely so with implementation technique also prescribed, the factors that could adversely influence the integrity of the computed values range from the variety of algorithmic implementation of the compute functions, to genuine implementation bugs either in the operating system or in the drivers, and even to hardware tweaks and short-cuts intentionally put in place in the name of performance speed-ups.

Compounding to this challenge is the inherent nature of the computational graph execution of deep learning network which tends to accumulate subtle computational differences as operations are processed one after another. Here, an out of order cast, clamp or accumulate could amplify the error especially when data types with smaller precision range e.g. FP16 are being used. A non-conformed hardware architecture that imitates floating-point data type with fixed-point registers can also easily throw the result into an inconclusive state during a verification test.

At the operation level, smaller operations with lower complexity may require much tighter tolerances (lower ULP) than larger and more complex operations (e.g. matmul, conv2d, analog reductions, etc.). The approach we deployed in our conformance test is to carefully review our ULP tolerances for each individual operations and pick the values that appear tight enough for each operation to the ideal computation results done on double-precision reference implementation. Coupled that with an ongoing evaluation of the integrity of the conformance suite with each IHV on a regular basis for different hardware architectures and driver implementations. It's a very tedious but crucial process with little room for error, and only the drivers that pass this vigorous test could then be certified.

In the graphics world there is a saying that no two graphics cards produce identical pixel-to-pixel parity on the same screen, but every game can still look and feel the same regardless. Compute is similar. Even though no two graphics hardware produce identical computed values, a model prediction can still be the same across two different hardware.

anssiko commented 3 years ago

@wchao1115 thank you for your in-depth explanation of challenges in the space of conformance testing ML operations.

Due to parallels with the graphics APIs I'm plugging in @Kangz the co-chair of the GPU for the Web Community Group to comment on conformance testing plans and learnings from WebGPU API effort as to understand whether there are conformance testing best practices reusable in the context of ML APIs for the Web.

Kangz commented 3 years ago

In GPU land there's similar issues where the WebGPU specification will specify computation that needs to happen but can have varying hardware implementations and thus varying precision. Right now we're focusing on testing the more "discrete" aspects of the API as numerical precision is not something that we can improve via the browser's WebGPU implementation, and so less of an interoperability risk.

However once we get to testing numerical precision I expect we'll have test of atomic features (like the precision of just the viewport computation) with tolerance as defined by the WebGPU specification. Testing the precision of large end2end operations is less tractable and less useful as many operations are orthogonal. It might be different for ML though if the implementation collapses operations together.

So basically I don't have a good answer for testing numerical precision. Otherwise WebGPU is pretty large and with a lot of tests and interactions between features so we wrote guidelines for writing the test suite that you might find interesting.

wchao1115 commented 3 years ago

From our experience putting a conformance suite together for DirectML that runs across different hardware implementations, here are some considerations from what we learned.

Consider having 2 levels of conformance testings -- model and operator level testing for better coverage. The operator level ensures long-term stability while the model level ensures proper user experiences.
Consider defining test tolerances in term of ULPs (units of last place) instead of absolute/hardcoded distance values.
Consider defining the test baseline in term of double precision values. We know that any ML operation today will be less precise than what a double-precision result would be. That makes it a good baseline and one that are invariant over time.
Consider different tolerance values for different operations. Different ML operations have different complexities e.g. an element-wise is organically less complex than a reduction or a convolution operation, etc. Using a single tolerance across multiple operations of different nature may lead to over-simplification and result in a much weaker conformance test.
Consider a tolerance value based on the numerical complexity of the operation being tested, and not on the results of a specific implementation of a specific hardware platform.

w3c / machine-learning-workshop

Conformance testing of ML APIs for the Web #80