add tips / tricks to project page

Using separate lists vs. lists of dictionaries
- Don’t break the structure of what belongs to what!
Looping: while loops are also an option!
Read paper and align code / update code (e.g., meta data enrichment)
Cleanup code (e.g., comments, etc.)
Modularising code (so that it works on multiple categories, pages, etc.)
Make “class names” flexible so that you don’t have to repeat yourself over and over again
Try & except: just use it for one part each, not for many things at the same time
Store raw data as JSON - parse in a second step
Separate “seeding” from “collecting information” stage
Write the data as soon as you can to a file (e.g., JSON) - not only at the end of a long (1.5 days!) scraping session (minimise data loss)
Storing all of the JSON, then only preprocess
Use selenium for dynamic websites (+ use new code snippet on Tilburg Science Hub to ope up selenium)
For anonymisation: use a hash function (salted!): https://nitratine.net/blog/post/how-to-hash-passwords-in-python/
For extended data collections - consider saving the raw html files first - then only parse!
How to find max. page numbers? You can do some calculations with information from the site (e.g., for AH.nl —> 1077/36 items on the page = 29.x pages)
Break up code into smaller modules (e.g., first seeds, then getting the data)

hannesdatta / course-odcm