Open petermr opened 2 years ago
Can you check if the last line is actually in an earlier position? When I was looking at PDF.js, the footer tended to be the second entry for each page, and would appear in position number 2. I had to sort lines by their Y coordinate to be able to detect paragraphs.
I will sort by Y.
The problem with spaces is that they have several meanings:
This sentence has a lot of whitespace to pad out
These are headings: Name Place Date
There is no deterministic algorithm to decide. Has to use content and context
On Sat, Apr 16, 2022 at 11:22 PM Dimitar Simeonov @.***> wrote:
Can you check if the last line is actually in an earlier position? When I was looking at PDF.js, the footer tended to be the second entry for each page, and would appear in position number 2. I had to sort lines by their Y coordinate to be able to detect paragraphs.
— Reply to this email directly, view it on GitHub https://github.com/mitko/readable_climate_reports/issues/10#issuecomment-1100763567, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS7GUW4MFTBHNQLYY43VFM4QPANCNFSM5TR3W6KQ . You are receiving this because you were assigned.Message ID: @.***>
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
My original; issue may indeed be a sorting artefact (and not a bug). Have written an x-then-y sorter. Will investigate further.
There is no guarantee of reading order. I think these pages may be in the order
Initial inspection of text from
ami3
PDF reader suggests that the list line of text on a page has been clipped. This may be an off-by-one error or it might be the wrong media-box for the reader. In practice it mainly clips the footer and does not affect the running text.Will need to assemble all
ami3
errors and debug so as to create a better release.