sul-dlss / vt-arclight

An Arclight-based discovery application for materials from the Virtual Tribunals project
5 stars 2 forks source link

Explore feasibility of offering OCR .txt files for download #278

Closed marlo-longley closed 1 year ago

marlo-longley commented 1 year ago

Text documents use a different viewer (Mirador) and I don't think they automatically display the transcripts for download as in the media players.

@laurensorensen asked about the possibility of doing this today.

We have the .txt files on the servers in vt/shared/data/fulltext but not in a public folder. Not sure if we want to take this on late in the workcycle, or what the UI implications might be, but we do have the files. If we decide to proceed we should just be mindful of the impacts on the harvesting pipeline and keep track of any changed paths.

jcoyne commented 1 year ago

Why would we want to do this? If someone wants the text, surely they would prefer to get the structured OCR from stacks where they can do things like exclude introduction pages,etc.

marlo-longley commented 1 year ago

That is a question for @laurensorensen -- apparently it was listed as a requirement for the product. She specifically asked me about plain text OCR.

ggeisler commented 1 year ago

We heard from potential users that they would really like the full text of all items in the collection so they can do text analysis, etc. It isn't clear that we currently have the necessary legal permission to offer a convenient single download of all collection data, but we were told we can offer the full-text at an item level.

If someone wants the text, surely they would prefer to get the structured OCR from stacks where they can do things like exclude introduction pages,etc.

@jcoyne Do you mean by going to, for example, https://purl.stanford.edu/ct549bj5123.xml?

If so, I think the structured data could be useful to the end-user, for sure. On the other hand, the average user is not going to have any idea that they can do that. So we would need to create a user-friendly affordance (a button or whatever) on the component show page that links to that, and/or to the unstructured .txt file we store in the application.

I think we're just trying to understand what our potential options are so we know whether we should consider adding an affordance for the user to get the full text for an item.

jcoyne commented 1 year ago

@ggeisler I'm thinking more of these Alto files: https://stacks.stanford.edu/file/druid:ct549bj5123/ct549bj5123_0002.xml

laurensorensen commented 1 year ago

In my experience working with researchers/ academics who create simple bash or python scripts who aren't devs, they'd prefer working with CSV or plain text and not touching XML - or learn how to write XSLT. Would be great to hear about other experiences if you have examples.

This is an actual request we got from the Spotlight exhibit in October 2021:

Dear Sir or Madam, thanks for publishing the documents of the International Court of Justice Archives of the Nuremberg International Military online. But i'm looking for the full/plain text of the digitalized images? i' assistant researcher at the university of wuppertal and i'm looking for material, which i can use for my lecture (python and the humanisits) . Can you provide me a description / How To / Tutorial to get the full/plain/orced text of the digitized documents? Thanks im Advantage, Malte Windrath

laurensorensen commented 1 year ago

FWIW, at this stage we don't have permission from ICJ to compile full corpus of OCR plain text. But Tom ok'd the one-at-a-time download of plain text OCR.

jcoyne commented 1 year ago

@laurensorensen the files we have created are the full document corpus compiled from distinct pages. They are not very useful, because we don't have good language detection or even have good script detection for the Cyrillic (or western diacritics). I given the preponderance of German in the documents, I believe it would be an embarrassment to consider releasing this data and suggesting people use it. Even the typed English is pretty unusable:

I cer+ify that none of the documents included herein have been denied bv the tribtnc1 en‘ thet this ocument book has been exnincd vith the prosecution in cccordence vith the ruling cf the tribunal dated 4 April 1945.

The Russian is much worse:

He Iy6JMKOBaTb IO Toro, kak npeCTaBJeHO Ha Iy6JMHOM 3aceqaHMV TpyHaua M TOJIbKO Ty vacTb, Koropas npecTaBeHa B KaHecTBe ORa3aTebCTBa

On some pages, where it didn't detect that it was a landscape orientation, (https://stacks.stanford.edu/file/druid:pc484zt3053/pc484zt3053_0072.xml) it's just junk:

a 9 0 co E1 6 D CD CO © a 0 R 8 2 CD N # CD S O C to + 5* Q © CD cO CD • ch CD • CD CD cd o CD # # CD S cO 60 m IO 3 CD cd 0 CD 0 C cd CD CD (D td CD CD CD rd .8 8 o CD (D CD (D CO # CD m CD CD CD CD I c CO E O CO CD CD id CO O Q 3 S cD CD CD CO CD CD CD cd g CO cd al E # CD CD c CD cd CO A CD A CO # CO CD 3 CD N # CD CD E cd 8 A C to 1 CD CD 3 # 8 O H CD <D CD to # S CD co cd CD CD 0) a) co CD o CD CD CO O N C cd H a £ 3 § CD © co (D CD CD Cd G O S co CD CO 8 CD a cd H 3 •6 5 3 co £ CD CD C H CD 9 CD 8 a cd H a g3 # cd CD cd CD E « 9 CD C C CD H S CD 8 CD cd co oo CD Td CD 3 8 CO 2 CD H O P A C CD CO CD to CD 50 8 8 N a 8 c G (D O G g G CD TO CD cd CD 3 G g G CD CD M 02 3 a G co # CD IU • a CD to CD to CD to 5 H CD U CD N • to 8 N CD 8 to co 8 F G to H CD F to 3 3 Ch CD H CD tJ G © co 3 8 G CO 3 co o CD 5 5 g OT <D Eo CD 0 cd Ch $ CD G co co to o Eh G co o a CD q g # + 6 G cd (D C E (D P $ CO a CD 8 CD E CD 3 N + CD 5 CD 5 s 8 cd CD to co G co CD O 2 (0 CD 0 G G CD G (D • O G CD 5 G 0 ,Q cd CD H G g CD to CD 2 CD E 1 co F E 8 3 cd g a CD CD 8 CD G CD 6 a 8 8 5 8 G G CD CD co •r

laurensorensen commented 1 year ago

Good to know, thanks for the examples! Sounds like we'll put it on the long term "must have" instead of short term.

marlo-longley commented 1 year ago

@laurensorensen @ggeisler @jcoyne Thank you for the discussion on this. I am thinking we can close this with the conclusion that we will not implement it this workcycle. If there's another path you'd like to take, let me know.