Closed kaljuvee closed 1 year ago
Scraper ⇒ S3 ⇒ Transform ⇒ S3 ⇒ Ingestion ⇒ Database
Raw data S3 path structure:
s3://{raw_bucket_name}/{scrape-date-YYmmdd}/...
Transformed data S3 path structure:
s3://{transformed_bucket_name}/{scrape-date-YYmmdd}/{year}/{month}/{day}/{country}/...
Where year/month/day are the date of the data, not the date of scraping.
De-duplication can happen during ingestion using an UPSERT
approach.
Initially we can run the jobs manually, but as the amount of scraped data grows, we can eventually orchestrate it using Airflow or similar.
Define high level architecture for data flow