as a patron I want the pdf endpoint to extract all urls from Global Connectivity Report so I can check their status

internetarchive / iari

Import workflows for the Wikipedia Citations Database

GNU General Public License v3.0

11 stars 9 forks source link

Open mojomonger opened 1 year ago

mojomonger commented 1 year ago

IARE:

produces only 1 URL link.

There are hundreds in the document, as you can see by looking at the document directly:

dpriskorn commented 1 year ago

Would it be useful to have a debug=true parameter that dumps all the text and annotations?

mojomonger commented 1 year ago

if that is the best way to dump the text, then yes!

dpriskorn commented 1 year ago

investigation: unknown document producer cause of not finding any links:

You cannot see this when rendered, but it is a broken/non-standard pdf. an edge case.

this works (no spaces): the spaces here cause the regex to not find the links:

dpriskorn commented 1 year ago

dpriskorn commented 1 year ago

in this edge case it would work to not remove the linebreaks and instead remove all spaces