Define update logic / data flow

Scraper ⇒ S3 ⇒ Transform ⇒ S3 ⇒ Ingestion ⇒ Database

Raw data S3 path structure:

s3://{raw_bucket_name}/{scrape-date-YYmmdd}/...

Transformed data S3 path structure:

s3://{transformed_bucket_name}/{scrape-date-YYmmdd}/{year}/{month}/{day}/{country}/...

Where year/month/day are the date of the data, not the date of scraping.

De-duplication can happen during ingestion using an UPSERT approach.

Initially we can run the jobs manually, but as the amount of scraped data grows, we can eventually orchestrate it using Airflow or similar.

kanvas-ai / artindex