Research Web Scraping - Githubissues

oli-lane commented 7 months ago

There are many different Python web scraping libraries we could use for this project, and each has advantages and disadvantages in terms of flexibility, usability, and web scraping capability. It's important that we choose the right services now so that we can avoid as many implementation issues as possible in the future. One of the sites we'll (probably) be scraping is the SLO public meetings calendar. For this task, you will be researching some available python libraries and adding your recommendation as a comment to this task with your reasoning.

A couple notes about the SLO public meetings calendar site that might complicate web scraping:

asynchronous loading (meetings load after main page is loaded)
user interactions (need to click on Agenda/Item Documents to get access the text we need to search through)
might need to open PDFs

Here are some potential libraries to get you started:

BeautifulSoup with Requests
Scrapy
Selenium
MechanicalSoup
Splash and scrapy-splash
Playwright
Puppeteer NOTE: these are just a few of many, you should look into others as well!

Requirements:

[ ] Recommendation with web scraping libraries
[ ] Reasoning of why this is the best option for our use case (advantages, disadvantages, compared to other options, etc.)
[ ] Reasoning references the SLO public meetings calendar (why the libraries would work there)
[ ] A rough roadmap of how the library/libraries can be implemented

O4FDev commented 7 months ago

Hi, just a quick disclaimer - I am not in any way affiliated with hack4impact [and frankly I haven't bothered to look at what the org does or their politics so for all I care it's Darth Vaders subsidiary and I wouldn't know.] or Cal Poly (though if anyone at the uni wishes to refer me my email is luke.lucas@ou.ac.uk 🤷🏻‍♂️ ) I am just a web-scrapper researcher & student at the Open University of Ireland who happened to find this repo so I have 0 context and I am just trying to figure out what I can do to help a little bit from the above comment.

I had a quick look at this site that you're potentially trying to web-scrape, turns out all of the useful data you want is contained within this web-page https://slocounty.granicus.com/ViewPublisher.php?view_id=48 which has all of the necessary HTML at request time.

You can find this by going into the network, waiting until that content is rendered on the page and checking for which requests were made at that time. It's a webpage inside of a webpage which is hella cool, props to the developer who made that.

That makes it possible to get all the needed HTML as just a GET request, here's an example: curl "https://slocounty.granicus.com/ViewPublisher.php?view_id=48" -o scrape.html The HTML output looks weirdly formatted but if you scroll down you'll get everything you need. (Maybe delete the GTag stuffs before you run it)

From that point, things like Agendas are just a tags so you only need to get the Href ex. <a href="//slocounty.granicus.com/AgendaViewer.php?view_id=48&event_id=4118" target="_blank">Agenda</a> and throw it into some sort of Queue system (or even better you likely won't need one if the amount of sites are under your processing limits)

Though you're all on your own for PDF scraping as I've never touched into that area.

The advantages of this is: it's incredibly cheap to run a get request on a cron job, you can probably just do this on railway.app with a Python/Node cron script that runs every day, with Requests, gets the HTML and parses it using something like BeautifulSoup then does whatever you need to do to the data from that point. And you'll be extremely under the free-tier limits.

If you run into any issues like IP-Blocking or the state playing a cat and mouse game over the structure of data give me an email and I can show you my research into how I did this at scale (for free across millions of URLs).