metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.03k stars 113 forks source link

Cuts off links that span two lines #40

Open marshalmiller opened 4 years ago

marshalmiller commented 4 years ago

Links that span spill over onto the second line are cut off when being recognized and thus reported as dead.

almereyda commented 3 years ago

Some links sometimes also span over multiple lines > 2 and are equally not recognised.

metachris commented 3 years ago

Thanks for the report/request. But it's hard to detect... Maybe someone is interested and can come up with a regex that works?

maximiliancw commented 1 year ago

Replacing all line breaks (e.g. \n) in the text before passing it to the regex should work? Specifically, we could do so in the extract_url function, I believe. Will try this out and submit a PR, if it works.