russelldj / military-grant-funding-exploration

Tools for scraping public grant funding data, with a special focus on assessing the impacts of military industrial complex funding
0 stars 0 forks source link

Scrape UC grant sponsor data #1

Open distraughteagle opened 4 months ago

distraughteagle commented 4 months ago

Goal is to scrape this site for sponsor name, project type, dollar amount, fiscal year, and campus. This requires us to select year and campus from the page.

The problem is that the buttons for making these selections are rendered from JavaScript code, and thus the elements are not directly inspectable for identification as html elements are. Following the attached tutorial, we could find the js script links and render them as html using selenium.

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)

# Now we use the driver to render the JavaScript webpage.
driver.get("https://www.universityofcalifornia.edu/sites/default/files/js/js_m-z7wfJFr8kC-DgZaM56y9hW43AEyQOcocz_6mfhuzo.js")
# page_source stores the HTML markup of the webpage, not the JavaScript code.
page_source = driver.page_source

But, I don't know how to locate the desired elements within the mess of html I get as a result. How to scrape JavaScript webpages using Selenium in Python by Lynn G. Kwong Medium.pdf

russelldj commented 4 months ago

I'm also struggling to make any progress on this. For the DTIC scraping, I relied heavily on html data_bt fields in the elements, which seem to be entirely missing here.

As a bigger-picture question, am I correct that this portal only gives us aggregate funding and per-project level descriptions?

russelldj commented 4 months ago

I made a little bit of progress on the Award Explorer by going to the embedded webpage that has the charts and toolbars here. Sorry it's not really documented or explained, I'll improve it in the future.

distraughteagle commented 4 months ago

Got a working script here: b51c021. Not the most elegant but it works! Would be nice to include "NSF Discipline" in the aggregate sponsor sheet, as this is available on the same page, I just skipped it for now.