internetarchive / iari

Import workflows for the Wikipedia Citations Database
GNU General Public License v3.0
12 stars 9 forks source link

as a patron I want to analyze a any url to a pdf with IARI and get a list of all the valid links #708

Closed dpriskorn closed 1 year ago

dpriskorn commented 1 year ago

https://www.cnn.com/2023/03/29/opinions/russia-putin-nuclear-blackmail-belarus-giles/index.html

dpriskorn commented 1 year ago

This an broadening of scope.

dpriskorn commented 1 year ago

I could make a new endpoint webpage/ and pdf/ for this. Alternatively one endpoint for both of these. Sequence for the webpage endpoint: Download the page using requests If not possible give back a simple error eg 400. Analyze the html using beautiful soup Find links/references Return json with all links found For pdf/: Get the size of the pdf using a head request Download the pdf if under oue current size threshold (e.g. 100mb) Extract the text of the pdf using a library Return json with all the links found.

dpriskorn commented 1 year ago

chatgpt code for pdfs

import requests
import io
import PyPDF2

# download the PDF from a URL
url = "https://www.example.com/example.pdf"
response = requests.get(url)
pdf_file = io.BytesIO(response.content)

# extract all the links from the PDF
pdf = PyPDF2.PdfFileReader(pdf_file)
for page_num in range(pdf.getNumPages()):
    page = pdf.getPage(page_num)
    annotations = page["/Annots"]
    for annotation in annotations:
        if annotation["/Subtype"] == "/Link":
            link = annotation["/A"]["/URI"]
            print(link)
dpriskorn commented 1 year ago

here is for webpages

import requests
from bs4 import BeautifulSoup

def extract_links(url):
    # download the webpage
    response = requests.get(url)

    # extract all the links from the HTML content
    soup = BeautifulSoup(response.content, "html.parser")
    links = []
    for link in soup.find_all("a"):
        href = link.get("href")
        if href is not None:
            links.append(href)

    return links