mitko / readable_climate_reports

Make climate reports machine readable, so they can be rendered in various inclusive ways
MIT License
4 stars 0 forks source link

PMR ami3 PDF reader clips last line of text #10

Open petermr opened 2 years ago

petermr commented 2 years ago

Initial inspection of text from ami3 PDF reader suggests that the list line of text on a page has been clipped. This may be an off-by-one error or it might be the wrong media-box for the reader. In practice it mainly clips the footer and does not affect the running text.

Will need to assemble all ami3 errors and debug so as to create a better release.

mitko commented 2 years ago

Can you check if the last line is actually in an earlier position? When I was looking at PDF.js, the footer tended to be the second entry for each page, and would appear in position number 2. I had to sort lines by their Y coordinate to be able to detect paragraphs.

petermr commented 2 years ago

I will sort by Y.

The problem with spaces is that they have several meanings:

This sentence has a lot of whitespace to pad out

These are headings: Name Place Date

There is no deterministic algorithm to decide. Has to use content and context

On Sat, Apr 16, 2022 at 11:22 PM Dimitar Simeonov @.***> wrote:

Can you check if the last line is actually in an earlier position? When I was looking at PDF.js, the footer tended to be the second entry for each page, and would appear in position number 2. I had to sort lines by their Y coordinate to be able to detect paragraphs.

— Reply to this email directly, view it on GitHub https://github.com/mitko/readable_climate_reports/issues/10#issuecomment-1100763567, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS7GUW4MFTBHNQLYY43VFM4QPANCNFSM5TR3W6KQ . You are receiving this because you were assigned.Message ID: @.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 2 years ago

My original; issue may indeed be a sorting artefact (and not a bug). Have written an x-then-y sorter. Will investigate further.

There is no guarantee of reading order. I think these pages may be in the order