oss-know / airflow-jobs

Apache License 2.0
6 stars 19 forks source link

Transfer ClickHouse data when initializing GitHub profiles #189

Closed crystaldust closed 1 year ago

crystaldust commented 1 year ago

Currently the github profile init DAG only stores data in opensearch, and then run a ck_transfer DAG to copy all github profile data in opensearch to clickhouse. So when the dag run again, it will copy duplicated data to clickhouse. It's better to transfer clickhouse data at the same time(within one DAG).

The design: Each developer's profile has an updated_at field, indicating when the profile is lastly modified. So select data in opensearch whose updated_at is greater than "latest updated_at in clickhouse", then insert them into clickhouse, is approximately enough.