w3ctag / design-reviews

W3C specs and API reviews
Creative Commons Zero v1.0 Universal
325 stars 55 forks source link

WICG Shape Detection API #176

Closed yellowdoge closed 6 years ago

yellowdoge commented 7 years ago

Hello TAG!

I'm requesting a TAG review of:

Further details (optional):

We'd prefer the TAG provide feedback as (please select one):

cynthia commented 7 years ago

From a quick skim, it seems like this is a wrapper spec for this API: https://developers.google.com/vision/

This seems like an interesting addition to the platform, but it also seems a bit risky in terms of implementation consistency across different UAs - and for platforms without a native API to wrap against, this would mean implementing it within the browser. For consistency reasons, having a reference native library implementation would probably make the adoption across implementations smoother. The "fast mode" bit only notes using some form of a speed-accuracy tradeoff algorithm, which I think amplifies the feature consistency risk even further.

Whether this would be a issue in practice (e.g. Browser X detects faces better than browser Y) is unclear - it does seem like it could potentially make content developers perform sniffing and redirect to proven/tested implementations given detection performance is worse on certain implementations.

One bug I noticed was in QR codes, which according to the specification (the canonical standard from Denso Wave) can contain binary data, so a string type may not be appropriate for the raw data.

It seems more natural for features like what is defined in this spec exist as a library rather than a built-in feature. However, the web is missing is the raw building blocks (e.g. BLAS) for scientific computing, which does make me wonder if that is what the platform will need for such libraries to exist.

yellowdoge commented 7 years ago

@cynthia thanks for taking the time to review the spec.

Re. the concern for platform-specificity, it is true that the Spec mechanics look like the GMSCore vision but that's just because it's an example of a broader and older API, implemented also by e.g. the Android AOSP, Mac and Windows 10. AOSP in particular has an implementation based on an open-source Neven detector dating from ~2008. All these APIs work in similar ways and return remarkably similar results because they are based on the work on Haar cascade classifiers for object detection which has been in, e.g. OpenCV, since at least the 2000s; there are reasons to believe that hardware ISP manufacturers use similar cascade techniques for e.g. handheld and smartphone cameras also for at least 10y+. In JS-land, there are several alternatives providing face detection using the same underlying classifiers, see the comparison here (spoiler: they are much slower than the OS/Hw versions).

IOW, leaving the performance argument aside, there are plenty of nearly indistinguishable implementations. This makes it possible to provide polyfills that should make unnecessary to integrate e.g. OpenCV in the browser. As a matter of fact, I started the polyfill effort and finished the barcode/qrcode part.

The same arguments would apply to the barcode/qr code and text detection: well known and established methods, widely implemented in a variety of open source libraries, and made available to native Apps by the OS due to performance. Here we're just offering them to the Web.

Note that the three detectors currently detailed have been chosen because of its widespread support (several different implementations/OSes) and clear use cases, which precludes wholesale bringing of any detector, e.g. cat detectors, to the web.

cynthia commented 7 years ago

@yellowdoge Apologies for sitting on this for so long.

I brought this up in a call quite a while ago - the follow-up took way too long. The raw minutes are here: https://pad.w3ctag.org/p/2017-05-30-minutes.md

We think this is a great addition to the platform - it is really about how to ship it while making it widely adoptable without requiring too much work (from the implementor's perspective), and thinking about extensibility (probably for level 2 of the standard) so users can use these APIs as building blocks. We would be more than happy to discuss about the next step after this ships - I believe that the web would welcome building blocks for machine learning and computer vision, but that is a large undertaking so I let's leave that discussion outside of the scope of this review.

As for the performance argument, WebASM should most likely improve the situation, but most likely not to native level. The other bit is that matrix support in JS is missing, and this does not seem like something that we will be seeing shipping soon, not to mention native implementations can even delegate the operations to dedicated hardware or DSP/GPUs. So yes, it is unlikely that a pure JS implementation will ever beat native performance.

I understand your arguments about Viola-Jones. This is more or less a stable approach - and given that it's fed with relatively similar data it should more or less render fairly similar results. Barcode and QR have fairly established methods too, so that shouldn't be a problem. QRs with binary data could be a problem with the spec as it stands, as noted above. Will file a bug on this, along with some other minor editorial bits.

Text is tricky. Especially when the API defines detected text to be available as a DOMString, this could be quite a bit of work to implement. I spent some time looking at the differences of the platform APIs across different OS implementations, it seems like for text iOS/macOS is missing the actual text detection bits, which seems like something that the browser would need to provide. (Given that even if support for this gets added later, it won't be available on older OS versions.)

Language support in text detection is another tricky topic - different implementations will most likely have different capabilities and accuracy (not only for text detected in a natural scene, but for complex languages [e.g. CJK] and RTL languages) - I haven't seen open source libraries that provide reasonable performance for multiple languages out of the box. I'm wondering if it would be better to tackle the two easy ones first, and discuss with other implementors about what they are willing to ship for the harder one. (text, in this case)

yellowdoge commented 7 years ago

We think this is a great addition to the platform - it is really about how to ship it while making it widely adoptable without requiring too much work (from the implementor's perspective), and thinking about extensibility (probably for level 2 of the standard) so users can use these APIs as building blocks. We would be more than happy to discuss about the next step after this ships - I believe that the web would welcome building blocks for machine learning and computer vision, but that is a large undertaking so I let's leave that discussion outside of the scope of this review.

Acknowledged!

As for the performance argument, WebASM should most likely improve the situation, but most likely not to native level. The other bit is that matrix support in JS is missing, and this does not seem like something that we will be seeing shipping soon, not to mention native implementations can even delegate the operations to dedicated hardware or DSP/GPUs. So yes, it is unlikely that a pure JS implementation will ever beat native performance.

Agree.

I understand your arguments about Viola-Jones. This is more or less a stable approach - and given that it's fed with relatively similar data it should more or less render fairly similar results. Barcode and QR have fairly established methods too, so that shouldn't be a problem. QRs with binary data could be a problem with the spec as it stands, as noted above. Will file a bug on this, along with some other minor editorial bits.

Done, at least the binary vs text one: https://github.com/WICG/shape-detection-api/issues/35

Text is tricky. Especially when the API defines detected text to be available as a DOMString, this could be quite a bit of work to implement. I spent some time looking at the differences of the platform APIs across different OS implementations, it seems like for text iOS/macOS is missing the actual text detection bits, which seems like something that the browser would need to provide. (Given that even if support for this gets added later, it won't be available on older OS versions.)

That's correct, Mac provides only the bounding boxes but not the result of any OCR inside of them. (Whereas Android and Win10 do seem to support OCR, see links in the Example of the Spec).

I guess in this case developers should rely on polyfills, probably using Tesseract -- but you had some concerns about its performance beyond pure document scanning use cases, right?

Language support in text detection is another tricky topic - different implementations will most likely have different capabilities and accuracy (not only for text detected in a natural scene, but for complex languages [e.g. CJK] and RTL languages) - I haven't seen open source libraries that provide reasonable performance for multiple languages out of the box. I'm wondering if it would be better to tackle the two easy ones first, and discuss with other implementors about what they are willing to ship for the harder one. (text, in this case)

I've never had first-hand experience with text detection on non-latin based languages, but I know that the Android implementation doesn't work well with either Hanzi nor Katakanas. Are you proposing treating Face+Barcode and Text differently?

Aside from this last remark, I understand from this discussion that the Spec looks good TAG-wise ? (Notwithstanding specific Issues to be filled).

cynthia commented 7 years ago

That's correct, Mac provides only the bounding boxes but not the result of any OCR inside of them. (Whereas Android and Win10 do seem to support OCR, see links in the Example of the Spec).

I guess in this case developers should rely on polyfills, probably using Tesseract -- but you had some concerns about its performance beyond pure document scanning use cases, right?

That is correct. It seems more likely that the raster image that one would acquire from a web application would be more likely to be something from a camera rather than a flatbed scanner, which will most likely have perspective distortion. I am not against the API, but I would like to tread carefully - especially in a high level, feature specific API like this.

Are you proposing treating Face+Barcode and Text differently?

Yes. To move the spec forward sooner - I would actually want to propose moving text out to a separate spec or a next major revision of the standard, both to reduce implementor workload for conformance and to ship a better standard.

Aside from this last remark, I understand from this discussion that the Spec looks good TAG-wise ? (Notwithstanding specific Issues to be filled).

Yes, indeed. Thanks for bringing this to our attention!

cynthia commented 6 years ago

Taken up on Nice F2F. Considering that all of our concerns have been raised on your side I believe the review can be considered done unless there are any significant design changes. (If this does happen, please re-open this issue and scream in my general direction.)

Thank you for working with us!