Add new fields to cocina models for OCR

peetucket commented 6 months ago

Need to add new fields to cocina-models

See https://docs.google.com/document/d/1ScvlgCI-YhyV2LaaDLTVOWTvuGe4bM1tmgzVbAHTDw8/edit#heading=h.r2urdb2h3frx

and

https://docs.google.com/document/d/1ADOY6Mr9pwVf2EUr2wb-dt8_KVUBxMSksrrcY5Qqi5o/edit#heading=h.8u9choka43a

Here is an example object with OCR files attached with the "transcription" role: https://argo-stage.stanford.edu/view/druid:qv402bt5465

We will also need to know where the OCR came from (manual or auto generated) and if it was corrected by human after being auto generated (so that we can prevent overwriting of manual corrected OCR).

Task is to:

where in cocina does this go? are these other top level attributes on the File struct? are there a group (like access)?
add to cocina-models
release new cocina models
update OpenAPI

justinlittman commented 6 months ago

Can you say more?

peetucket commented 6 months ago

Can you say more?

Yes, adding more detail to ticket description now

peetucket commented 6 months ago

@andrewjbtw

We are thinking of adding these new cocina file attributes as described in document/d/1ScvlgCI-YhyV2LaaDLTVOWTvuGe4bM1tmgzVbAHTDw8/edit#heading=h.og29aoqammz0 (e.g. sdrGenerated and manuallyCorrected) at the level of the File in cocina, e.g.

It may be more helpful to further clarify the names of the attributes, e.g.

sdrGeneratedText or something like that? maybe a term that applies equally well to both OCR and transcription?

justinlittman commented 6 months ago

I'd suggest putting it in its own section within file rather than at file level.

peetucket commented 6 months ago

How about, at the file level:

textExtraction:
   sdrGenerated: true/false  (defaut: false)
   manuallyCorrected: true/false (default: false)

The entire block is optional.

We will know if this refers to OCR or transcription based on the use value, which will be "transcription" or "caption". This use value is displayed as Role in Argo in the Contents section.

andrewjbtw commented 6 months ago

What if an item has:

one captioned video and one OCR'd PDF
two caption files, in different languages, and only one of the languages has been corrected

peetucket commented 6 months ago

The use attribute and this new block go with the File, so each file can have different values. Should we also add a "language" attribute to the textExtraction block while we are it?

The fact that it goes with the file should allow for those situations you outlined above, since each new language will be in it's own file?

We currently don't have any object level text extraction attributes, so in order to know that text has been extracted for anything within the object, you would need to go digging into all of the files. This is probably still ok to keep this way unless we have a reason to also indicate this at the object level (i.e. at least one file as a use or caption or transcription)

andrewjbtw commented 6 months ago

Sorry - I misread above about file vs object level.

We added language to the model last fall to support captions. I'm not sure if that should get pulled into this structure or stay where it is. This druid is an example where we have captions in two different languages: https://argo.stanford.edu/view/qf378nj5000

jcoyne commented 6 months ago

How about putting these as a subclass of https://cocina.sul.stanford.edu/models/file and using the existing type attribute? Do we need to make these properties separate? That is, could we have only sdrGenerated and userProvided as the properties? If so, would the administrative schema be a place for them? Would we want to track the version of software they were generated with, so we could query only certain values for re-generation?

How do we handle multiple languages? Do we need to consider translations?

peetucket commented 6 months ago

Can you say more about you mean by subclass of https://cocina.sul.stanford.edu/models/file and using the existing type attribute? Right now each file's type attribute is always https://cocina.sul.stanford.edu/models/file I think. These new properties tell us additional information about how the file itself, in addition to the existing attributes (like size, mimetype, etc)

I see the languageTag attribute at the file level for https://argo.stanford.edu/view/qf378nj5000 for the qf378nj5000_cap.vtt and qf378nj5000_spa_cap.vtt files. That is working as I'd expect (each language in it's own file with its own languageTag).

We could add these new sdrGenerated and manuallyCorrected attributes at the same level instead of creating a new group. These would keep it in the same place as the existing languageTag attribute.

jcoyne commented 6 months ago

Can you say more about you mean by subclass of https://cocina.sul.stanford.edu/models/file and using the existing type attribute? Right now each file's type attribute is always https://cocina.sul.stanford.edu/models/file I think.

Since we control the model, there is nothing stopping us from defining a URL like https://cocina.sul.stanford.edu/models/generatedTranscription which we define as a subclass of https://cocina.sul.stanford.edu/models/file.

These new properties tell us additional information about how the file itself, in addition to the existing attributes (like size, mimetype, etc)

I don't think these properties have much use outside of our management systems. Thus administrative would be a place to put them.

I see the languageTag attribute at the file level for https://argo.stanford.edu/view/qf378nj5000 for the qf378nj5000_cap.vtt and qf378nj5000_spa_cap.vtt files. That is working as I'd expect (each language in it's own file with its own languageTag).

That is used for captions, but I don't think we did the modeling very well there. We can't tell if this is a transcript or a translation using that string. This drives the video viewer, which only needs a very simple model. I think this is broken when multiple languages are being spoken.

What happens if you OCR this?

peetucket commented 6 months ago

Since we control the model, there is nothing stopping us from defining a URL like https://cocina.sul.stanford.edu/models/generatedTranscription which we define as a subclass of https://cocina.sul.stanford.edu/models/file.

OK, so this is a new value for the "type" attribute. What about the manuallyCorrected?

These new properties tell us additional information about how the file itself, in addition to the existing attributes (like size, mimetype, etc)

I don't think these properties have much use outside of our management systems. Thus administrative would be a place to put them.

These attributes go with each file though, while administrative is at the object level? They may be used by future systems (like an editing interface)?

I see the languageTag attribute at the file level for https://argo.stanford.edu/view/qf378nj5000 for the qf378nj5000_cap.vtt and qf378nj5000_spa_cap.vtt files. That is working as I'd expect (each language in it's own file with its own languageTag).

That is used for captions, but I don't think we did the modeling very well there. We can't tell if this is a transcript or a translation using that string. This drives the video viewer, which only needs a very simple model. I think this is broken when multiple languages are being spoken.

What happens if you OCR this?

Don't know but I think each OCR text file will be for a single language?

jcoyne commented 6 months ago

OK, so this is a new value for the "type" attribute. What about the manuallyCorrected?

Make another like: https://cocina.sul.stanford.edu/models/manuallyCreatedTranscript perhaps? Do we need to differentiate between manuallyCorrected and user supplied? I thought we would treat both of those cases identically.

These attributes go with each file though, while administrative is at the object level? They may be used by future systems (like an editing interface)?

There is a file level administrative schema too.

Don't know but I think each OCR text file will be for a single language?

👍

andrewjbtw commented 6 months ago

I'm not sure how we should record multi-lingual docs, to be honest. We're just getting one XML document from ABBYY.

peetucket commented 6 months ago

OK, so this is a new value for the "type" attribute. What about the manuallyCorrected?

Make another like: https://cocina.sul.stanford.edu/models/manuallyCreatedTranscript perhaps? Do we need to differentiate between manuallyCorrected and user supplied? I thought we would treat both of those cases identically.

These attributes go with each file though, while administrative is at the object level? They may be used by future systems (like an editing interface)?

There is a file level administrative schema too.

I would prefer adding some extra file level administrative metadata (e.g. in the same place as the preserve/publish/shelve attributes) over changing the type of the file that it is. I mean, it's ultimately still a file that will live on stacks, it just happens to be a file that has OCR of some kind, right?

justinlittman commented 6 months ago

We don't make use of subclasses for anything else, right? It just seems like the sort of thing that we would forget / overlook.

jcoyne commented 6 months ago

Yes, I'm not totally convinced that is the right way, but certainly an option. That way we can validate that only certain types of files ought to have these properties, but as @justinlittman says, it might be a challenge.

dnoneill commented 6 months ago

If ABBYY creates a single PDF for all files how are we supposed to know when one page is corrected? Is the ALTO for each page going to be preserved and that is what gets corrected? Or is it going to be the ALTO for the druid? I don't feel like I have a good grasp on how this will work.

peetucket commented 6 months ago

If ABBYY creates a single PDF for all files how are we supposed to know when one page is corrected? Is the ALTO for each page going to be preserved and that is what gets corrected? Or is it going to be the ALTO for the druid? I don't feel like I have a good grasp on how this will work.

I'm not sure we have a good idea of how manually corrected OCR will work right now. There will likely need to be an editor that allows the user to access all of the OCR for all of the pages, and under the hood, will change individual ALTO XML files. How this gets get reflected back into the single PDF with all of the OCR is not known right now.

Which raises a good question, which is if we should consider treating attributes like "sdrGenerated" or "manuallyCorrected" at an object level instead of a file level, since this how we will be running OCR. In other words, if OCR is run through our automated workflow, it will either run OCR for the whole object or not at all, it won't do some individual pages only (because this precludes a full PDF coming out).

And likewise, if a user manually corrects some of the OCR, that implies the entire object should not be automatically OCRed in the future.

andrewjbtw commented 6 months ago

We don't have current plans to correct OCR. But if we did correct it, we'd need to reaccession the set of OCR files so that they all match. If you just correct one page, you still need to regenerate the whole PDF to match that correction.

sul-dlss / cocina-models

Add new fields to cocina models for OCR #705