w3c / apa

APA WG deliverables have been moved to individual repositories
Other
47 stars 38 forks source link

[SAUR] Claims about ASR accuracy #229

Closed nigelmegitt closed 2 years ago

nigelmegitt commented 2 years ago

In SAUR 2.2.2 Captions in Live Media there's a section about subtitle latency in meetings, that references ASR as a technique, and repeats some claims of accuracy.

Getting high accuracy with ASR is highly dependent on many factors, and we should be very careful about how they are presented. For general use, where there may be non-speech background sounds, ASR accuracy is very far below human accuracy, though the latency may be better. Specifically in the case of meetings, it may be more likely to get better results. It is also important for non-hearing participants that they see transcribed non-speech sounds, and few ASR systems generate captions for that case. Finally, it should be noted that few ASR systems adjust the positional placement of subtitles or captions over the video, which can result in important parts of the video image being obscured.

My request is to be even clearer about the caveats when using ASR: consider even removing the accuracy text altogether and instead focusing only on latency, since the SAUR is only about synchronisation.

As an editorial matter, "Artificial Intelligence" is a loaded term: in the industry, few people consider that it exists; rather, "Machine Learning" is probably a more useful term.

RealJoshue108 commented 2 years ago

Thanks for that @nigelmegitt we will discuss in Research Questions - note we meet on Weds at 2pm (Uk time) if you ever want to join.

Also +1 from me on AI as a term, I personally prefer 'automated machine reasoning' as more accurate nomenclature.

RealJoshue108 commented 2 years ago

Note - regarding the AI comment - I'm not offering this as a suggestion though, I'm happy to discuss. This is more my own personal preference. I don't like the term artificial intelligence anyway, as I generally think it is inaccurate.

RealJoshue108 commented 2 years ago

@raja supports removing the discussion around accuracy, in the recent RQTF call.

jasonjgw commented 2 years ago

The view which appears to have emerged from the meeting today is that, as suggested in this issue, we should carefully articulate the trade-offs between latency and accuracy in connection with the use of ASR. It was also noted that the apparent latency advantage of ASR principally arises in live rather than prerecorded media - exactly the circumstances in which background sounds and inadequate audio quality limit its accuracy.

RealJoshue108 commented 2 years ago

@jasonjgw do you want to have a crack at that in the document?

jasonjgw commented 2 years ago

I have added text describing the factors that can influence ASR accuracy (with a reference to a recent paper). Also, an editors' note has been inserted indicating that we are continuing to work on the issue.

I don't think it can be resolved in time for publication of the first public working draft. For instance, in the paper that I found, the authors suggest that word error rate is not necessarily the appropriate measure of ASR accuracy outside of dictation applications. In the live captioning case, there is no opportunity for errors to be corrected (except after the fact), a quite different scenario from typical dictation systems.

For now, I think we should make any further adjustments that Task Force participants request at the meeting this week, then revisit the issue following publication of the draft.

RealJoshue108 commented 2 years ago

@Steve-Noble thinks this can be closed - already addressed

Steve-Noble commented 2 years ago

Confirmed. Revised language to address these comments was added to the 28 September 2021 version of the SAUR working draft. This Issue can now be closed.