This repository hosts the course website of Tilburg University's open education class on "Online Data Collection and Management" (oDCM) - learn how to collect web data for your empirical research projects!
Here are the highlights of a podcast about web scraping by a bootcamp instructor that touches on a variety of topics related to oDCM (navigating the DOM, cleaning text data, timers, selenium vs other tools)
Kimberly Fessel (PhD) - Metis
Request and Beautifulsoup
Two strategies:
Look for unique attributes (ids / classes)
Navigate the Document Object Model - DOM (children, sibling) ~ tree like structure
Selenium is the solution; launches a Google Chrome driver; sometimes it as simple as launching the site with selenium and then processing the data with request and Beautifulsoup.
Other advantages: clicking on things and filling out fields
Scrapy - cloud deployment and built a "spider" (scraper that keeps on going and look for new links)
Importance of visualising your results dynamically/interactively (D3, Plotly, Tableau)
Data widgets getting more mainstream (e.g., NYT) - people getting more data literate
Legality of Web Scraping
Video that answers the question whether web scraping is legal. They share your view and illustrate this with recent law cases:
Have to make a clear distinction between types of data:
Publicly available data (e.g., public LinkedIn profile)
User had made the data public
No account required for access
Not blocked by robots.txt
hiQ Labs case (scraped public LinkedIn profiles for workplace analytics)
Craigslist case (start-ups use their location data)
Podcasts
Here are the highlights of a podcast about web scraping by a bootcamp instructor that touches on a variety of topics related to oDCM (navigating the DOM, cleaning text data, timers, selenium vs other tools)
replace()
initiallyrequest
Legality of Web Scraping
Video that answers the question whether web scraping is legal. They share your view and illustrate this with recent law cases:
Originally posted by @RoyKlaasseBos in https://github.com/hannesdatta/course-odcm/issues/14#issuecomment-733671043