hannesdatta / course-odcm

This repository hosts the course website of Tilburg University's open education class on "Online Data Collection and Management" (oDCM) - learn how to collect web data for your empirical research projects!
https://odcm.hannesdatta.com
13 stars 25 forks source link

Develop sitemap for building blocks #1

Closed hannesdatta closed 3 years ago

hannesdatta commented 3 years ago

@RoyKlaasseBos, please develop a menu hierarchy for our building blocks and code snippets. The dev branch already has a couple of examples.

There are several example documents to start from, which I am posting below. There's some "housekeeping"/"integration" to be done.

hannesdatta commented 3 years ago

NOTES 1

  1. Retrieving data a. Making a connection i. Overview/principle
  2. Visible browser
  3. Headless browser
  4. API ii. Tools
  5. Python
  6. R
  7. Commercial tools b. Locating content i. Overview/principles
  8. From source code to elements of interest
  9. Stability of selectors ii. CSS iii. XPATH iv. Regular expressions c. Saving data i. Textual data
  10. Parsing to CSV
  11. Cleaning on-the-fly (e.g., also anonymization) ii. Image data
  12. Save as JPG/PNG iii. Other data types d. Navigation and interaction with site i. Principles/overview
  13. Navigate to sample units
  14. Navigate within sample units (pagination)
  15. Retrieval limits ii. Navigation tech
  16. Navigate by URL a. Base URLs b. Parameters in URLs (increasing &limit)
  17. Navigate by clicking
  18. Navigate by scrolling
  19. Interacting with forms
  20. Waiting
  21. Changing IP addresses iii. Looping
  22. Storing data a. Principles/overview i. Prototyping; during vs. after the scrape ii. Storage of raw data: yes vs. no iii. Parsing on-the-fly vs. after scrape iv. Storage technology
  23. Considerations: stability, scalability, backup, assessing space requirements
  24. Storage location: Local vs. Remote
  25. Storage type: Files vs. Databases
  26. Storage format: JSON/tabular, CSV, …

b. File-based storage i. Local

  1. Directory structure

  2. Automated zipping and wiping ii. Remote

  3. Mirroring local directory structure to cloud c. Database storage i. Local

  4. Structured databases (tabular data)

  5. Unstructured databases (JSON data) ii. Remote

  6. Structured databases (tabular data)

  7. Unstructured databases (JSON data) d. Meta data enrichment i. Time stamps

  8. Time of job initiation

  9. Time of actual scrape (unixtime) ii. IP address iii. Experimental conditions iv. Screenshots

  10. Deployment of data collection a. Execution of data collection i. Infrastructure

  11. Local

  12. Remotely (EC2, …) ii. Scheduling

  13. Frequency of execution

  14. Triggers

  15. Restarts after errors b. Storage considerations i. Assessing space requirements ii. Storage location

  16. Locally

  17. Remotely iii. Storage type

  18. File-based approach

  19. Database approach a. Structured database b. Unstructured database iv. After scraping • Typically, in flattened CSV or JSON files (“text”); files; images c. Monitoring and Handling Errors i. Execution / triggers ii. Collected raw data iii. Parsed data

hannesdatta commented 3 years ago

NOTES 2

Content

Web scraping

Data management

Ethics

Topics for the guide

Web scraping

Saving and writing locally and remotely (databases, file-based systems)

: Data Management and Deployment in Production