internetarchive / iari

Import workflows for the Wikipedia Citations Database
GNU General Public License v3.0
11 stars 9 forks source link

as a patron I want the pdf endpoint to extract all urls from Global Connectivity Report so I can check their status #844

Open mojomonger opened 1 year ago

mojomonger commented 1 year ago

IARE:

https://internetarchive.github.io/iare/?url=https://www.itu.int/dms_pub/itu-d/opb/ind/d-ind-global.01-2022-pdf-e.pdf

produces only 1 URL link.

There are hundreds in the document, as you can see by looking at the document directly:

https://www.itu.int/dms_pub/itu-d/opb/ind/d-ind-global.01-2022-pdf-e.pdf

dpriskorn commented 1 year ago

Would it be useful to have a debug=true parameter that dumps all the text and annotations?

mojomonger commented 1 year ago

if that is the best way to dump the text, then yes!

dpriskorn commented 1 year ago

investigation: image unknown document producer cause of not finding any links:

You cannot see this when rendered, but it is a broken/non-standard pdf. an edge case.

see https://archive.org/services/context/iari/v2/statistics/pdf?url=https://www.itu.int/dms_pub/itu-d/opb/ind/d-ind-global.01-2022-pdf-e.pdf&debug=true&refresh=true

this works (no spaces): image the spaces here cause the regex to not find the links: image

dpriskorn commented 1 year ago

possible solution https://github.com/internetarchive/iari/issues/852

dpriskorn commented 1 year ago

in this edge case it would work to not remove the linebreaks and instead remove all spaces