Closed hannesdatta closed 3 years ago
NOTES 1
b. File-based storage i. Local
Directory structure
Automated zipping and wiping ii. Remote
Mirroring local directory structure to cloud c. Database storage i. Local
Structured databases (tabular data)
Unstructured databases (JSON data) ii. Remote
Structured databases (tabular data)
Unstructured databases (JSON data) d. Meta data enrichment i. Time stamps
Time of job initiation
Time of actual scrape (unixtime) ii. IP address iii. Experimental conditions iv. Screenshots
Deployment of data collection a. Execution of data collection i. Infrastructure
Local
Remotely (EC2, …) ii. Scheduling
Frequency of execution
Triggers
Restarts after errors b. Storage considerations i. Assessing space requirements ii. Storage location
Locally
Remotely iii. Storage type
File-based approach
Database approach a. Structured database b. Unstructured database iv. After scraping • Typically, in flattened CSV or JSON files (“text”); files; images c. Monitoring and Handling Errors i. Execution / triggers ii. Collected raw data iii. Parsed data
NOTES 2
Database technology
Structured: MySQL, Google BigQuery
Unstructured: Amazon DynamoDB, MongoDB
File-based systems
Local: SSD, HDD
Server/cloud: S3, FTP, Google Drive
Per database
Schema/design
Extracting and writing data
Indexing
Maintenance
ShinyApps / Interactive dashboards
Scraper 1: Static scraper, single-machine, no database (two versions: either with or without browser); backward-looking versus forward-looking
Scraper 2: Dynamic scraper, with database connection + monitoring
Writing to file
Writing to S3
SQL - write
SQL - read
Technologies to watch
: Data Management and Deployment in Production
Software Stack
Computing Infrastructure
@RoyKlaasseBos, please develop a menu hierarchy for our building blocks and code snippets. The dev branch already has a couple of examples.
There are several example documents to start from, which I am posting below. There's some "housekeeping"/"integration" to be done.