metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.03k stars 113 forks source link

Way to check only real hyperlinks #18

Open capncodewash opened 8 years ago

capncodewash commented 8 years ago

Hi there, I've been using pdfx with the '-c' option to check links in PDFs.

I was wondering if there's a way to restrict the list of links that it checks to actual PDF hyperlinks - because it seems to also pull out any non-hyperlinked body text that contains a URL.

My PDFs sometimes contain example URLs that shouldn't validate ( e.g. http://your-subdomain.example.com ) as plain text, so I want to avoid checking these.

Thanks,

Graeme

metachris commented 8 years ago

Currently pdfx -c checks all the links without the ability to restrict to only pdfs. I will add this to the feature backlog.

On Tue, Jul 26, 2016 at 9:20 AM, Graeme West notifications@github.com wrote:

Hi there, I've been using pdfx with the '-c' option to check links in PDFs.

I was wondering if there's a way to restrict the list of links that it checks to actual PDF hyperlinks - because it seems to also pull out any non-hyperlinked body text that contains a URL.

My PDFs sometimes contain example URLs that shouldn't validate ( e.g. http://your-subdomain.example.com ) as plain text, so I want to avoid checking these.

Thanks,

Graeme

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/metachris/pdfx/issues/18, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHIyywUy-yNQFyUYo3b314Rv8Mc6Zyyks5qZbUlgaJpZM4JU2rV .

capncodewash commented 8 years ago

Thank you Chris!