mozilla / overscripted

Repository for the Mozilla Overscripted Data Mining Challenge
Mozilla Public License 2.0
75 stars 53 forks source link

Jupyter Notebook for scraping the JS files #8

Closed willougr closed 5 years ago

willougr commented 6 years ago

Jupyter Notebook for scraping the JS files from urls listed in the data set along with urls listed from the Princeton survey. Some potential optimizations with the code are listed within the file.

Note: at the moment only works with a sample from the dataset that can be held in memory by pandas. Will require reworking to use spark in order to handle the entire dataset.

willougr commented 6 years ago

@birdsarah The JS file scrapping code we chatted about