kanvas-ai / artindex

Art Index
GNU General Public License v3.0
0 stars 0 forks source link

Define update logic / data flow #16

Closed kaljuvee closed 1 year ago

kaljuvee commented 1 year ago

Define high level architecture for data flow

battlesnake commented 1 year ago
Scraper ⇒ S3 ⇒ Transform ⇒ S3 ⇒ Ingestion ⇒ Database

Raw data S3 path structure:

s3://{raw_bucket_name}/{scrape-date-YYmmdd}/...

Transformed data S3 path structure:

s3://{transformed_bucket_name}/{scrape-date-YYmmdd}/{year}/{month}/{day}/{country}/...

Where year/month/day are the date of the data, not the date of scraping.

De-duplication can happen during ingestion using an UPSERT approach.

Initially we can run the jobs manually, but as the amount of scraped data grows, we can eventually orchestrate it using Airflow or similar.