Develop sitemap for building blocks

hannesdatta commented 3 years ago

@RoyKlaasseBos, please develop a menu hierarchy for our building blocks and code snippets. The dev branch already has a couple of examples.

There are several example documents to start from, which I am posting below. There's some "housekeeping"/"integration" to be done.

hannesdatta commented 3 years ago

NOTES 1

Retrieving data a. Making a connection i. Overview/principle
Visible browser
Headless browser
API ii. Tools
Python
R
Commercial tools b. Locating content i. Overview/principles
From source code to elements of interest
Stability of selectors ii. CSS iii. XPATH iv. Regular expressions c. Saving data i. Textual data
Parsing to CSV
Cleaning on-the-fly (e.g., also anonymization) ii. Image data
Save as JPG/PNG iii. Other data types d. Navigation and interaction with site i. Principles/overview
Navigate to sample units
Navigate within sample units (pagination)
Retrieval limits ii. Navigation tech
Navigate by URL a. Base URLs b. Parameters in URLs (increasing &limit)
Navigate by clicking
Navigate by scrolling
Interacting with forms
Waiting
Changing IP addresses iii. Looping
…
Storing data a. Principles/overview i. Prototyping; during vs. after the scrape ii. Storage of raw data: yes vs. no iii. Parsing on-the-fly vs. after scrape iv. Storage technology
Considerations: stability, scalability, backup, assessing space requirements
Storage location: Local vs. Remote
Storage type: Files vs. Databases
Storage format: JSON/tabular, CSV, …

b. File-based storage i. Local

Directory structure
Automated zipping and wiping ii. Remote
Mirroring local directory structure to cloud c. Database storage i. Local
Structured databases (tabular data)
Unstructured databases (JSON data) ii. Remote
Structured databases (tabular data)
Unstructured databases (JSON data) d. Meta data enrichment i. Time stamps
Time of job initiation
Time of actual scrape (unixtime) ii. IP address iii. Experimental conditions iv. Screenshots
Deployment of data collection a. Execution of data collection i. Infrastructure
Local
Remotely (EC2, …) ii. Scheduling
Frequency of execution
Triggers
Restarts after errors b. Storage considerations i. Assessing space requirements ii. Storage location
Locally
Remotely iii. Storage type
File-based approach
Database approach a. Structured database b. Unstructured database iv. After scraping • Typically, in flattened CSV or JSON files (“text”); files; images c. Monitoring and Handling Errors i. Execution / triggers ii. Collected raw data iii. Parsed data

hannesdatta commented 3 years ago

NOTES 2

Content

Database technology
Structured: MySQL, Google BigQuery
Unstructured: Amazon DynamoDB, MongoDB
File-based systems
Local: SSD, HDD
Server/cloud: S3, FTP, Google Drive
Per database
Schema/design
Extracting and writing data
Indexing
Maintenance
ShinyApps / Interactive dashboards
Scraper 1: Static scraper, single-machine, no database (two versions: either with or without browser); backward-looking versus forward-looking
Scraper 2: Dynamic scraper, with database connection + monitoring

: Data Management and Deployment in Production