plazi / GoldenGATE-Imagine

A GUI Tool For Freeing Text and Data from PDF Documents
Other
5 stars 0 forks source link

table issue 8407FFA8D92EF026B72A212DFF8AFF96 vertebrate zoology #11

Open myrmoteras opened 3 years ago

myrmoteras commented 3 years ago

in this article, I can't get the table to mark. Is there anything we can do about it? image

I get an error message that is can't create proper html image

is there anything that could be done with these consequtive images? image

is there a reason, that almost of hypenhanted word had to be corrected in the QC?

gsautter commented 3 years ago

is there a reason, that almost of hypenhanted word had to be corrected in the QC?

"Tools > Check Text Flow Breaks" might offer some help here, especially if there were problems in the page structure right after decoding. Otherwise hard to tell ... there might well have been respective errors in earlier documents as well, but we've only been able to check for this type of error since the generalization of the QC infrastructure this summer/fall.

gsautter commented 3 years ago

in this article, I can't get the table to mark. Is there anything we can do about it?

Yes, we've had the (rotated) sub window editing since September ... just click in the page edge, select "Edit Page in Sub Window", and select "90° Clockwise Rotation". Then, you can mark the table in the rotated sub window, and it writes back to the main window after closing the rotated dialog with "OK" ... see also https://github.com/plazi/ggi/issues/102

gsautter commented 3 years ago

In the upright page orientation, the columns (actually the rows) are just too dense for the "Mark Table" macro to find any viable column splits.

gsautter commented 3 years ago

In general, Table 2 is a bit of a nightmare, though ... the portion on Page 6 (number 28) actually has two parts, the bottom one continuing on Page 7 (number 29), and those two parts together continue on the right of the top part on Page 6 ... This constitutes a pretty tricky layout, as tiling table together right now only works if the tiles exhibit some regularity, in particular one tile on top of exactly one tile, and/or one tile to the left of exactly one tile, as otherwise the logic for overall assembly becomes prohibitively complex ... and here we have two tiles in top-bottom arrangement to the right of one tile, which we cannot resolve into an overall table at this point ...

Not sure whether or not more complex logic capable of handling the above makes sense to pursue, as such somewhat asymetric cases seem to be extremely rare at this point, and you can still copy the individual tables and patch them together in Excel.

gsautter commented 3 years ago

is there anything that could be done with these consecutive images?

Multipart images are in the works ... still have to figure out cases where some of the tiles are rotated while others are not, but in general the next build should be able to connect multipart images and handle them as one.

myrmoteras commented 3 years ago

is there a reason, that almost of hypenhanted word had to be corrected in the QC?

"Tools > Check Text Flow Breaks" might offer some help here, especially if there were problems in the page structure right after decoding. Otherwise hard to tell ... there might well have been respective errors in earlier documents as well, but we've only been able to check for this type of error since the generalization of the QC infrastructure this summer/fall.

this I processed today, so it should not have something to do with the old QC infrastructure. Will try out the tool.

Could it have something to do with decoding that the "-" has been decoded using a different symbol that is not considered a hyphenation?

gsautter commented 3 years ago

this I processed today, so it should not have something to do with the old QC infrastructure. Will try out the tool.

Sure it does not have anything to do with the old infrastructure ... was merely explaining why we've been seeing hyphenation errors only rather recently, namely because the old infrastructure did not report them.

Could it have something to do with decoding that the "-" has been decoded using a different symbol that is not considered a hyphenation?

That's very likely, yes ... what was the dash/hyphen decoded to? There is a good bit of normalization going on in this regard already, but there can always be some Unicode point that's still lacking on the list of possible hyphens ...