oss-know / airflow-jobs

Apache License 2.0
6 stars 19 forks source link

Make unified format of gits origin #165

Closed crystaldust closed 1 year ago

crystaldust commented 1 year ago

Both https://github.com/OWNER/REPO.git and https://github.com/OWNER/REPO are valid git urls. When specify 'includes' variable for daily sync, and its origin is different from the data in OpenSearch, like opensearch doc's origin contains '.git' suffix while 'includes' doesn't. Then there will be 2 different (owner, repo, origin) tuples.

If then we do a full repo daily sync, the 2 tuples will be considered as 2 code bases, and will sync data separately, introducing redundant data into OpenSearch, then to ClickHouse.

The solution is to make sure to eliminate the '.git' suffix before init or sync data.