issues
search
hannesdatta
/
course-odcm
This repository hosts the course website of Tilburg University's open education class on "Online Data Collection and Management" (oDCM) - learn how to collect web data for your empirical research projects!
https://odcm.hannesdatta.com
12
stars
24
forks
source link
add tips / tricks to project page
#84
Closed
hannesdatta
closed
2 years ago
hannesdatta
commented
2 years ago
Using separate lists vs. lists of dictionaries
Don’t break the structure of what belongs to what!
Looping: while loops are also an option!
Read paper and align code / update code (e.g., meta data enrichment)
Cleanup code (e.g., comments, etc.)
Modularising code (so that it works on multiple categories, pages, etc.)
Make “class names” flexible so that you don’t have to repeat yourself over and over again
Try & except: just use it for one part each, not for many things at the same time
Store raw data as JSON - parse in a second step
Separate “seeding” from “collecting information” stage
Write the data as soon as you can to a file (e.g., JSON) - not only at the end of a long (1.5 days!) scraping session (minimise data loss)
Storing all of the JSON, then only preprocess
Use selenium for dynamic websites (+ use new code snippet on Tilburg Science Hub to ope up selenium)
For anonymisation: use a hash function (salted!):
https://nitratine.net/blog/post/how-to-hash-passwords-in-python/
For extended data collections - consider saving the raw html files first - then only parse!
How to find max. page numbers? You can do some calculations with information from the site (e.g., for AH.nl —> 1077/36 items on the page = 29.x pages)
Break up code into smaller modules (e.g., first seeds, then getting the data)