Closed peetucket closed 5 months ago
Can you say more?
Can you say more?
Yes, adding more detail to ticket description now
@andrewjbtw
We are thinking of adding these new cocina file attributes as described in document/d/1ScvlgCI-YhyV2LaaDLTVOWTvuGe4bM1tmgzVbAHTDw8/edit#heading=h.og29aoqammz0 (e.g. sdrGenerated
and manuallyCorrected
) at the level of the File in cocina, e.g.
It may be more helpful to further clarify the names of the attributes, e.g.
sdrGeneratedText
or something like that? maybe a term that applies equally well to both OCR and transcription?
I'd suggest putting it in its own section within file rather than at file level.
How about, at the file level:
textExtraction:
sdrGenerated: true/false (defaut: false)
manuallyCorrected: true/false (default: false)
The entire block is optional.
We will know if this refers to OCR or transcription based on the use
value, which will be "transcription" or "caption". This use
value is displayed as Role
in Argo in the Contents section.
What if an item has:
The use
attribute and this new block go with the File, so each file can have different values. Should we also add a "language" attribute to the textExtraction
block while we are it?
The fact that it goes with the file should allow for those situations you outlined above, since each new language will be in it's own file?
We currently don't have any object level text extraction attributes, so in order to know that text has been extracted for anything within the object, you would need to go digging into all of the files. This is probably still ok to keep this way unless we have a reason to also indicate this at the object level (i.e. at least one file as a use
or caption
or transcription
)
Sorry - I misread above about file vs object level.
We added language to the model last fall to support captions. I'm not sure if that should get pulled into this structure or stay where it is. This druid is an example where we have captions in two different languages: https://argo.stanford.edu/view/qf378nj5000
How about putting these as a subclass of https://cocina.sul.stanford.edu/models/file
and using the existing type attribute? Do we need to make these properties separate? That is, could we have only sdrGenerated
and userProvided
as the properties? If so, would the administrative
schema be a place for them? Would we want to track the version of software they were generated with, so we could query only certain values for re-generation?
How do we handle multiple languages? Do we need to consider translations?
Can you say more about you mean by subclass of https://cocina.sul.stanford.edu/models/file and using the existing type attribute?
Right now each file's type attribute is always https://cocina.sul.stanford.edu/models/file
I think. These new properties tell us additional information about how the file itself, in addition to the existing attributes (like size, mimetype, etc)
I see the languageTag
attribute at the file level for https://argo.stanford.edu/view/qf378nj5000 for the qf378nj5000_cap.vtt
and qf378nj5000_spa_cap.vtt
files. That is working as I'd expect (each language in it's own file with its own languageTag
).
We could add these new sdrGenerated
and manuallyCorrected
attributes at the same level instead of creating a new group. These would keep it in the same place as the existing languageTag
attribute.
Can you say more about you mean by subclass of https://cocina.sul.stanford.edu/models/file and using the existing type attribute? Right now each file's type attribute is always https://cocina.sul.stanford.edu/models/file I think.
Since we control the model, there is nothing stopping us from defining a URL like https://cocina.sul.stanford.edu/models/generatedTranscription
which we define as a subclass of https://cocina.sul.stanford.edu/models/file
.
These new properties tell us additional information about how the file itself, in addition to the existing attributes (like size, mimetype, etc)
I don't think these properties have much use outside of our management systems. Thus administrative
would be a place to put them.
I see the languageTag attribute at the file level for https://argo.stanford.edu/view/qf378nj5000 for the qf378nj5000_cap.vtt and qf378nj5000_spa_cap.vtt files. That is working as I'd expect (each language in it's own file with its own languageTag).
That is used for captions, but I don't think we did the modeling very well there. We can't tell if this is a transcript or a translation using that string. This drives the video viewer, which only needs a very simple model. I think this is broken when multiple languages are being spoken.
What happens if you OCR this?
Since we control the model, there is nothing stopping us from defining a URL like https://cocina.sul.stanford.edu/models/generatedTranscription which we define as a subclass of https://cocina.sul.stanford.edu/models/file.
OK, so this is a new value for the "type" attribute. What about the manuallyCorrected?
These new properties tell us additional information about how the file itself, in addition to the existing attributes (like size, mimetype, etc)
I don't think these properties have much use outside of our management systems. Thus
administrative
would be a place to put them.
These attributes go with each file though, while administrative is at the object level? They may be used by future systems (like an editing interface)?
I see the languageTag attribute at the file level for https://argo.stanford.edu/view/qf378nj5000 for the qf378nj5000_cap.vtt and qf378nj5000_spa_cap.vtt files. That is working as I'd expect (each language in it's own file with its own languageTag).
That is used for captions, but I don't think we did the modeling very well there. We can't tell if this is a transcript or a translation using that string. This drives the video viewer, which only needs a very simple model. I think this is broken when multiple languages are being spoken.
What happens if you OCR this?
Don't know but I think each OCR text file will be for a single language?
OK, so this is a new value for the "type" attribute. What about the manuallyCorrected?
Make another like: https://cocina.sul.stanford.edu/models/manuallyCreatedTranscript perhaps? Do we need to differentiate between manuallyCorrected and user supplied? I thought we would treat both of those cases identically.
These attributes go with each file though, while administrative is at the object level? They may be used by future systems (like an editing interface)?
There is a file level administrative schema too.
Don't know but I think each OCR text file will be for a single language?
👍
I'm not sure how we should record multi-lingual docs, to be honest. We're just getting one XML document from ABBYY.
OK, so this is a new value for the "type" attribute. What about the manuallyCorrected?
Make another like: https://cocina.sul.stanford.edu/models/manuallyCreatedTranscript perhaps? Do we need to differentiate between manuallyCorrected and user supplied? I thought we would treat both of those cases identically.
These attributes go with each file though, while administrative is at the object level? They may be used by future systems (like an editing interface)?
There is a file level administrative schema too.
I would prefer adding some extra file level administrative metadata (e.g. in the same place as the preserve/publish/shelve attributes) over changing the type of the file that it is. I mean, it's ultimately still a file that will live on stacks, it just happens to be a file that has OCR of some kind, right?
We don't make use of subclasses for anything else, right? It just seems like the sort of thing that we would forget / overlook.
Yes, I'm not totally convinced that is the right way, but certainly an option. That way we can validate that only certain types of files ought to have these properties, but as @justinlittman says, it might be a challenge.
If ABBYY creates a single PDF for all files how are we supposed to know when one page is corrected? Is the ALTO for each page going to be preserved and that is what gets corrected? Or is it going to be the ALTO for the druid? I don't feel like I have a good grasp on how this will work.
If ABBYY creates a single PDF for all files how are we supposed to know when one page is corrected? Is the ALTO for each page going to be preserved and that is what gets corrected? Or is it going to be the ALTO for the druid? I don't feel like I have a good grasp on how this will work.
I'm not sure we have a good idea of how manually corrected OCR will work right now. There will likely need to be an editor that allows the user to access all of the OCR for all of the pages, and under the hood, will change individual ALTO XML files. How this gets get reflected back into the single PDF with all of the OCR is not known right now.
Which raises a good question, which is if we should consider treating attributes like "sdrGenerated" or "manuallyCorrected" at an object level instead of a file level, since this how we will be running OCR. In other words, if OCR is run through our automated workflow, it will either run OCR for the whole object or not at all, it won't do some individual pages only (because this precludes a full PDF coming out).
And likewise, if a user manually corrects some of the OCR, that implies the entire object should not be automatically OCRed in the future.
We don't have current plans to correct OCR. But if we did correct it, we'd need to reaccession the set of OCR files so that they all match. If you just correct one page, you still need to regenerate the whole PDF to match that correction.
Need to add new fields to cocina-models
See https://docs.google.com/document/d/1ScvlgCI-YhyV2LaaDLTVOWTvuGe4bM1tmgzVbAHTDw8/edit#heading=h.r2urdb2h3frx
and
https://docs.google.com/document/d/1ADOY6Mr9pwVf2EUr2wb-dt8_KVUBxMSksrrcY5Qqi5o/edit#heading=h.8u9choka43a
Here is an example object with OCR files attached with the "transcription" role: https://argo-stage.stanford.edu/view/druid:qv402bt5465
We will also need to know where the OCR came from (manual or auto generated) and if it was corrected by human after being auto generated (so that we can prevent overwriting of manual corrected OCR).
Task is to:
access
)?