internetarchive / iari

Import workflows for the Wikipedia Citations Database
GNU General Public License v3.0
11 stars 9 forks source link

URLs from PDF are inaccurate and do not reflect link in original document #753

Closed dpriskorn closed 1 year ago

dpriskorn commented 1 year ago

From:

https://www.foundationforfreedomonline.com/wp-content/uploads/2023/03/FFO-FLASH-REPORT-REV.pdf

I see this URL:

https://www.cisa.gov/topics/election-security/foreign-influence-operations-and-disinformation

But the parser for iare represented that is:

https://www.cisa.gov/topics/election-security/foreign-influence-operations-and-

Resulting in a false 404

dpriskorn commented 1 year ago

Unfortunately this is related to a line break which seems to not be detected correctly by pypdf2. I tried with the command line pdftotext utility and got: image which is also not a valid url because a slash got eaten somewhere in the process from pdf->text

dpriskorn commented 1 year ago

I created https://github.com/py-pdf/pypdf/issues/1810

dpriskorn commented 1 year ago

The issue persists with pypdf See https://archive.org/services/context/wari/v2/statistics/pdf?url=https://www.foundationforfreedomonline.com/wp-content/uploads/2023/03/FFO-FLASH-REPORT-REV.pdf

mojomonger commented 1 year ago

For the Covid-19 test pdf, It has less false broken links (254 this time vs. 453 from last time), but, there are still some “broken” urls. One in particular:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7610519/

In the PDF document, page 299, the reference can be seen as:

Rambaut, A., Holmes, E.C., O’Toole, Á., et al. (2020). A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature Microbiology. Published online July 15, 2020. https://doi.org/10.1038/s41564-020-0770-5 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7610519/; Bedford Cohen Interview Science

note the trailing semicolon after the link

mojomonger commented 1 year ago

For a PDF example that contains a single link that spans two lines, you can use:

https://www.foundationforfreedomonline.com/wp-content/uploads/2023/03/FFO-FLASH-REPORT-REV.pdf

Running the pdf endpoint on this currently returns a broken link, in that it is adding an extra "All" to the end of the line:

https://www.cisa.gov/topics/election-security/foreign-influence-operations-and-disinformationAll

dpriskorn commented 1 year ago

I'm sorry to say that beyond https://github.com/internetarchive/iari/issues/776 this is not fixable without ML like gpt.

The main problem is that we are trying to structure unstructured information and we don't know where the link end and the unrelated text around starts and there is no sure way to determine that using a regex.

See https://github.com/internetarchive/iari/issues/777

dpriskorn commented 1 year ago

For the Covid-19 test pdf, It has less false broken links (254 this time vs. 453 from last time), but, there are still some “broken” urls. One in particular:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7610519/

In the PDF document, page 299, the reference can be seen as:

Rambaut, A., Holmes, E.C., O’Toole, Á., et al. (2020). A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature Microbiology. Published online July 15, 2020. https://doi.org/10.1038/s41564-020-0770-5 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7610519/; Bedford Cohen Interview Science

note the trailing semicolon after the link

Huggingchat failed to extract the links correctly at the first try Screenshot_20230501-063122_Firefox.png

dpriskorn commented 1 year ago

This is now fixed for pdfs that contain annotations. Documents without annotatations cannot be reliably extracted without the use of advanced ML.

dpriskorn commented 1 year ago

stalled do closing