hack4impact-calpoly / ecologistics-web-scraper

https://ecologistics-web-scraper.vercel.app
5 stars 0 forks source link

Research Web Scraping #29

Closed oli-lane closed 6 months ago

oli-lane commented 7 months ago

There are many different Python web scraping libraries we could use for this project, and each has advantages and disadvantages in terms of flexibility, usability, and web scraping capability. It's important that we choose the right services now so that we can avoid as many implementation issues as possible in the future. One of the sites we'll (probably) be scraping is the SLO public meetings calendar. For this task, you will be researching some available python libraries and adding your recommendation as a comment to this task with your reasoning.

A couple notes about the SLO public meetings calendar site that might complicate web scraping:

Here are some potential libraries to get you started:

Requirements:

O4FDev commented 7 months ago

Hi, just a quick disclaimer - I am not in any way affiliated with hack4impact [and frankly I haven't bothered to look at what the org does or their politics so for all I care it's Darth Vaders subsidiary and I wouldn't know.] or Cal Poly (though if anyone at the uni wishes to refer me my email is luke.lucas@ou.ac.uk 🤷🏻‍♂️ ) I am just a web-scrapper researcher & student at the Open University of Ireland who happened to find this repo so I have 0 context and I am just trying to figure out what I can do to help a little bit from the above comment.

I had a quick look at this site that you're potentially trying to web-scrape, turns out all of the useful data you want is contained within this web-page https://slocounty.granicus.com/ViewPublisher.php?view_id=48 which has all of the necessary HTML at request time.

You can find this by going into the network, waiting until that content is rendered on the page and checking for which requests were made at that time. It's a webpage inside of a webpage which is hella cool, props to the developer who made that.

That makes it possible to get all the needed HTML as just a GET request, here's an example: curl "https://slocounty.granicus.com/ViewPublisher.php?view_id=48" -o scrape.html The HTML output looks weirdly formatted but if you scroll down you'll get everything you need. (Maybe delete the GTag stuffs before you run it)

From that point, things like Agendas are just a tags so you only need to get the Href ex. <a href="//slocounty.granicus.com/AgendaViewer.php?view_id=48&amp;event_id=4118" target="_blank">Agenda</a> and throw it into some sort of Queue system (or even better you likely won't need one if the amount of sites are under your processing limits)

Though you're all on your own for PDF scraping as I've never touched into that area.

The advantages of this is: it's incredibly cheap to run a get request on a cron job, you can probably just do this on railway.app with a Python/Node cron script that runs every day, with Requests, gets the HTML and parses it using something like BeautifulSoup then does whatever you need to do to the data from that point. And you'll be extremely under the free-tier limits.

If you run into any issues like IP-Blocking or the state playing a cat and mouse game over the structure of data give me an email and I can show you my research into how I did this at scale (for free across millions of URLs).

sanjanachecker commented 7 months ago

just some research on the frameworks provided if we were to use them: h4i web scraping research.pdf

ishavarrier commented 7 months ago

research on a few libraries and process: Web Scraping Research

wesleytam88 commented 7 months ago

Researched Selenium and MechanicalSoup as well as a couple of PDF parsers Web Scraping Research.pdf

Robert303V commented 7 months ago

Looked into a few libraries and whether or not they had the functionality of our notes, and looked further into Pyppeteer & Playwright documentation.

Web Scraping Research.pdf

jam-kt commented 7 months ago

Few general thoughts on scraping libraries/frameworks

Python Webscraping.pdf