plazi / treatmentBank

Repository devoted to house keeping of treatmentBank
0 stars 0 forks source link

DCL JATS import to TB using Pensoft pathway? #103

Open myrmoteras opened 11 months ago

myrmoteras commented 11 months ago

issue

we have many old scanned publications with few illustrations that take a lot of time to process via GGI because of the OCR errors we get.

A solution is to the send those articles/journals to DCL, get them converted in JATS, may be a simple, minimal version of Tapub.

To import them use the Taxpub import pathway we have for the Pensoft journals. which would allow to use GGe to annotate names and mostly manually material citations.

feasibility

@gsautter how feasible is this from the TB import point of view?

gsautter commented 11 months ago

In general, this should be quite feasible ... however, most likely not via the Pensoft pathway, as that is specifically tailored to their use of TaxPub (and variations therein over time), so more generic JATS would most likely require a somewhat different approach, as well as a good few additional taggers to run after the fact.

Another aspect is that while DCL transcripts are good, they might not be perfect, so it'd be a shame to disconnect the XML from the page images ... an alternative approach would be to take the XML for structure and text, still run OCR, create an IMF, and then structure and OCR correct the latter by means of the JATS ... which in essence will give us the best of both worlds.

myrmoteras commented 11 months ago

Yes, we can combine and make the best out of it, but this takes a lot of time.

We assume that DCL JATS is good enough for our purpose, and the few OCR issues we can fix with GGe. They are the OCR specialists.

Why not make a simple version first so we can process, and then later add the more complex version that depends on few major changes you need to make (core split, XMF).

We could use the Pensoft XML to develop a mininmal XML?

What is the effort to get a straight JATS import as described above?

gsautter commented 11 months ago

Yes, we can combine and make the best out of it, but this takes a lot of time.

We assume that DCL JATS is good enough for our purpose, and the few OCR issues we can fix with GGe. They are the OCR specialists.

Sure, and I am not questioning that ... the idea is to OCR the page images, and not correct them at all, but simply use the OCR result (or whichever parts of it are correct) for matching to the XML, so basically use our own OCR as anchor points for matching and pinning the DCL XML to the page images, thereby adding positional information, etc., and then all but discard the original OCR result altogether.

Why not make a simple version first so we can process, and then later add the more complex version that depends on few major changes you need to make (core split, XMF).

Frankly, by the time we insert all the detail markup into the JATS, most of the work is done either way ... and by not fitting to the page images, we'll basically simply create more legacy XML that we have to get back to at some point, synchronize UUIDs, etc.

We could use the Pensoft XML to develop a mininmal XML?

Sounds like a sensible plan to me ... in order of importance, I'd want to have at least the following elements: paragraph, italics, bold, heading, section, pageBreak, caption. These elements should give us enough structure to base the semantic enhancement on.

What is the effort to get a straight JATS import as described above?

Depends up looking at a bunch of examples, what layout peculiarities they have, and what level of detail we want to take the semantic enhancement to ... "Order out of Chaos" took a sweet while, mainly because the target granularity wasn't clear, and also there were a good few OCR errors in quite crucial places, namely in the punctuation around in-line treatment citations, which foiled automated tagging of the latter and required a ton of manual correction (including character guess work, which would have been straightforward in presence of the page images).

lyubomirpenev commented 11 months ago

Given the fact that Plazi are working on extracting material citations from Pensoft Taxpub full-text XMLs and that we already aligned the identifiers  with Plazi, I wonder if it wouldn't be more feasible to convert journals' InDesign files to rich TaxPub JATS XMLs through Pensoft services? This would minimise the effort, and most importantly, streamline the XML harvesting workflow at Plazi from various sources with minimum errors and quality checks needed.

On 28.9.2023 г. 1:26, Guido Sautter wrote:

Yes, we can combine and make the best out of it, but this takes a
lot of time.

We assume that DCL JATS is good enough for our purpose, and the
few OCR issues we can fix with GGe. They are the OCR specialists.

Sure, and I am not questioning that ... the idea is to OCR the page images, and not correct them at all, but simply use the OCR result (or whichever parts of it are correct) for matching to the XML, so basically use our own OCR as anchor points for matching and pinning the DCL XML to the page images, thereby adding positional information, etc., and then all but discard the original OCR result altogether.

Why not make a simple version first so we can process, and then
later add the more complex version that depends on few major
changes you need to make (core split, XMF).

Frankly, by the time we insert all the detail markup into the JATS, most of the work is done either way ... and by not fitting to the page images, we'll basically simply create more legacy XML that we have to get back to at some point, synchronize UUIDs, etc.

We could use the Pensoft XML to develop amininmal XML
<https://github.com/plazi/ggxml2taxpub/issues/60>?

Sounds like a sensible plan to me ... in order of importance, I'd want to have at least the following elements: |paragraph|, |italics|, |bold|, |heading|, |section|, |pageBreak|, |caption|. These elements should give us enough structure to base the semantic enhancement on.

What is the effort to get a straight JATS import as described above?

Depends up looking at a bunch of examples, what layout peculiarities they have, and what level of detail we want to take the semantic enhancement to ... "Order out of Chaos" took a sweet while, mainly because the target granularity wasn't clear, and also there were a good few OCR errors in quite crucial places, namely in the punctuation around in-line treatment citations, which foiled automated tagging of the latter and required a ton of manual correction (including character guess work, which would have been straightforward in presence of the page images).

— Reply to this email directly, view it on GitHub https://github.com/plazi/treatmentBank/issues/103#issuecomment-1738190041, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDFNGLWDQCKFLTI3EPK2I3X4SRZJANCNFSM6AAAAAA5J5RNYM. You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Pensoft logo https://pensoft.net
Prof. Dr. Lyubomir Penev Managing Director Pensoft Publishers https://pensoft.net Phone: +359-2-8704281 12 Prof. Georgi Zlatarski Street 1700 Sofia, Bulgaria https://twitter.com/Pensoft https://www.facebook.com/Pensoft/ https://www.linkedin.com/company/pensoft-publishers/ Blog https://blog.pensoft.net