Open mojomonger opened 1 year ago
Would it be useful to have a debug=true parameter that dumps all the text and annotations?
if that is the best way to dump the text, then yes!
investigation: unknown document producer cause of not finding any links:
You cannot see this when rendered, but it is a broken/non-standard pdf. an edge case.
this works (no spaces): the spaces here cause the regex to not find the links:
possible solution https://github.com/internetarchive/iari/issues/852
in this edge case it would work to not remove the linebreaks and instead remove all spaces
IARE:
https://internetarchive.github.io/iare/?url=https://www.itu.int/dms_pub/itu-d/opb/ind/d-ind-global.01-2022-pdf-e.pdf
produces only 1 URL link.
There are hundreds in the document, as you can see by looking at the document directly:
https://www.itu.int/dms_pub/itu-d/opb/ind/d-ind-global.01-2022-pdf-e.pdf