Closed dpriskorn closed 1 year ago
Unfortunately this is related to a line break which seems to not be detected correctly by pypdf2. I tried with the command line pdftotext utility and got: which is also not a valid url because a slash got eaten somewhere in the process from pdf->text
For the Covid-19 test pdf, It has less false broken links (254 this time vs. 453 from last time), but, there are still some “broken” urls. One in particular:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7610519/
In the PDF document, page 299, the reference can be seen as:
Rambaut, A., Holmes, E.C., O’Toole, Á., et al. (2020). A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature Microbiology. Published online July 15, 2020. https://doi.org/10.1038/s41564-020-0770-5 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7610519/; Bedford Cohen Interview Science
note the trailing semicolon after the link
For a PDF example that contains a single link that spans two lines, you can use:
https://www.foundationforfreedomonline.com/wp-content/uploads/2023/03/FFO-FLASH-REPORT-REV.pdf
Running the pdf endpoint on this currently returns a broken link, in that it is adding an extra "All" to the end of the line:
https://www.cisa.gov/topics/election-security/foreign-influence-operations-and-disinformationAll
I'm sorry to say that beyond https://github.com/internetarchive/iari/issues/776 this is not fixable without ML like gpt.
The main problem is that we are trying to structure unstructured information and we don't know where the link end and the unrelated text around starts and there is no sure way to determine that using a regex.
For the Covid-19 test pdf, It has less false broken links (254 this time vs. 453 from last time), but, there are still some “broken” urls. One in particular:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7610519/
In the PDF document, page 299, the reference can be seen as:
Rambaut, A., Holmes, E.C., O’Toole, Á., et al. (2020). A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature Microbiology. Published online July 15, 2020. https://doi.org/10.1038/s41564-020-0770-5 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7610519/; Bedford Cohen Interview Science
note the trailing semicolon after the link
Huggingchat failed to extract the links correctly at the first try
This is now fixed for pdfs that contain annotations. Documents without annotatations cannot be reliably extracted without the use of advanced ML.
stalled do closing
From:
https://www.foundationforfreedomonline.com/wp-content/uploads/2023/03/FFO-FLASH-REPORT-REV.pdf
I see this URL:
https://www.cisa.gov/topics/election-security/foreign-influence-operations-and-disinformation
But the parser for iare represented that is:
https://www.cisa.gov/topics/election-security/foreign-influence-operations-and-
Resulting in a false 404