plazi / Biodiversity-Literature-Repository

covers the creating, maintenance and upload to the BLR
3 stars 0 forks source link

taxonomic treatments; xhtml version of treatment with italics, etc #79

Open myrmoteras opened 4 years ago

myrmoteras commented 4 years ago

in this case, image

where we have OCR input, we run into a problem, that we have wrong italics, bold (plus other OCR artifacts).

Do we have to show emphasis?

myrmoteras commented 4 years ago

@gsautter @slint is there a rule or reason, that we get this sort of zebra formatting?

https://doi.org/10.5281/zenodo.3818928

FFCFFFB2FFE57F0FFFCBDD55FF8FF517

image

gsautter commented 4 years ago

Could be the fold face of the in-line headings ("DIAGNOSIS" and "REDESCRIPTION") extends to the whole paragraph for some reason ... need to check the generated HTML.

gsautter commented 4 years ago

BTW, that's the wrong treatment UUID up there ... leads to "Caulotops platensis (Berg)" ... the right link is http://tb.plazi.org/GgServer/html/03F687CAFFF77F1BFF5CDDCCFB9EF4C5

gsautter commented 4 years ago

The HTML proper looks OK to me ... it does contain <b> tags nested in one another for some reason, but they are all closed properly in the right place ... only "DIAGNOSIS" should be bold, not the whole paragraph. That's for the HTML our outgoing XSLT produces, however ... I just looked at the HTML of the Zenodo page, and one of the two closing </b> tags is missing there ... maybe some regex based cleanup going on? Hard to tell ...

gsautter commented 4 years ago

I've been checking the IMF, and for what it looks like, the small-caps in-line headings are kind of messing up the emphases around them a bit ... looking into ways of cleaning this up ...

gsautter commented 4 years ago

Fixed for this one ... using some extended cleanup functionality I just integrated in GGI's "Check Annotation Nesting" tool, which is also part of the batch (both server and desktop) ... comes with next update.

gsautter commented 4 years ago

Sorry, overlooked the other question to @slint about the italics in OCR output ... only fixed the "zebra formatting" ...