Closed dpriskorn closed 1 year ago
This an broadening of scope.
I could make a new endpoint webpage/ and pdf/ for this. Alternatively one endpoint for both of these. Sequence for the webpage endpoint: Download the page using requests If not possible give back a simple error eg 400. Analyze the html using beautiful soup Find links/references Return json with all links found For pdf/: Get the size of the pdf using a head request Download the pdf if under oue current size threshold (e.g. 100mb) Extract the text of the pdf using a library Return json with all the links found.
chatgpt code for pdfs
import requests
import io
import PyPDF2
# download the PDF from a URL
url = "https://www.example.com/example.pdf"
response = requests.get(url)
pdf_file = io.BytesIO(response.content)
# extract all the links from the PDF
pdf = PyPDF2.PdfFileReader(pdf_file)
for page_num in range(pdf.getNumPages()):
page = pdf.getPage(page_num)
annotations = page["/Annots"]
for annotation in annotations:
if annotation["/Subtype"] == "/Link":
link = annotation["/A"]["/URI"]
print(link)
here is for webpages
import requests
from bs4 import BeautifulSoup
def extract_links(url):
# download the webpage
response = requests.get(url)
# extract all the links from the HTML content
soup = BeautifulSoup(response.content, "html.parser")
links = []
for link in soup.find_all("a"):
href = link.get("href")
if href is not None:
links.append(href)
return links
https://www.cnn.com/2023/03/29/opinions/russia-putin-nuclear-blackmail-belarus-giles/index.html