tusharchou / local-data-platform

python library for iceberg lake house on your local
2 stars 0 forks source link

[!NOTE] Hi Followers, Thank you for taking the time to read me. Let me help you understand the scope and progess with better ease below:

  1. Projects
  2. Milestones
  3. Issues
  4. Pull Request
  5. Wiki
  6. Documentation

Plan

Milestone Epic Target Date
0.0.1 Ready for Feedback 1st Oct 24
1.0.0 Ready for Production 1st Nov 24

Milestone

#6

Local Data Platform

Business information systems require fresh data every day organised in a manner that retrival is cost effective. Making a local data platform requires a setup where you can recreate production usecases and develop new pipelines.

Problem Statement

What? : a local data platform that can scale up to cloud Why? : save costs on cloud infra and developement time When? : start of product development life cycle Where? : local first Who? : Business who want a product data platform that will run locally and scale up when the time comes.

A python library that uses open source tools to orchestrate a data platform operations locally for development and testing

Components

  1. Orchestrator
    • cron
    • Airflow
  2. Source
    • APIs
    • Files
  3. Target
    • Iceberg
    • DuckDB
    • Space and Time
  4. Catalog
    • Rest

Source

Bulk data

Data can be available as single file in the source format. For example New York Yellow taxi data is available to be pulled from here

curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet

local-data-platform/

Target

CSV

Human readable format and accessible platforms like google sheets or notion Easily pushed into

References

Self Promotion