sul-dlss / content_search

IIIF Content Search API implementation for OCR in DOR
Other
10 stars 1 forks source link

Analysis around ALTO tools for QA and confidence assurance #8

Closed jkeck closed 6 years ago

jkeck commented 6 years ago

What's our minimum confidence that we're willing to index into our search service? What analysis tools exist (e.g. ALTO overlay tool).

cbeer commented 6 years ago

See WC and CC properties of strings: https://www.loc.gov/standards/alto/techcenter/layout.html.

anarchivist commented 6 years ago

There are also PC and ACCURACY attributes on page elements.

However, I'm not sure that we're able to determine this easily unless we know the project specs -- this feels more like a service management concern. For example, for VT, we have a high level of accuracy in the project specs, but the ALTO itself doesn't directly specify the accuracy levels for this project.

anarchivist commented 6 years ago

For ALTO overlay tools:

anarchivist commented 6 years ago

@caaster and I met today to talk through this, and we agree that for the most part this is a service management concern.

Specific actions:

  1. Ticket and facilitate a conversation about when the content search box should or should not appear in UV within sul-embed. - sul-dlss/universalviewer#47
  2. Proceed with an investigation of what types of OCR we have in SDR.
  3. Make a concrete set of recommendations regarding service management: types of OCR we want to support for autoindexing, a confirmation that qualitative metrics are not a blocker to autoindexing, and identifying potential curators that would be good partners to test. (relies on 2)
  4. Define how we roll out a soft rollout of this feature with existing OCR (relies on 1 and 3)
  5. Discuss content search as a new service to be added to our service portfolio, including identifying a service manager, a PO, etc., and identify a way to manage/field questions around legacy OCR. (relies on all the above)