I created datasets for each of our GH repo that track who gave it a "star"
I then combined them all into one "community" derivative dataset
Because some of our repos are mostly internal - they may not have any stars
This causes a problem - currently if the source of a root dataset does not produce any data - the SetDataSchema event is not written - and datasets without schema will block derivative transformations from executing
In this PR:
I change push/pull ingest to define schema as soon as possible, even if input data is empty using the read stage schema
I changed DF engine to produce Parquet files even if output DataFrame is empty - this way I can propagate output schema to kamu
kamu now handles engines propagating empty Parquet files to define derivative dataset schema, even if the dataset stays empty
I eliminate the hack where one of the old Parquet files was passed as a "Schema carrier" to the engine - instead kamu will always write an empty Parquet file using schema from SetDataSchema
Checklist before requesting a review
[x] Unit and integration tests added
[x] Compatibility:
[x] Network APIs: ✅
[x] Workspace layout and metadata: ✅
[x] Configuration: ✅
[x] Container images: ❌
Requires new DF engine. I will consider updating Flink, Spark, RW too
Description
Problem scenario:
SetDataSchema
event is not written - and datasets without schema will block derivative transformations from executingIn this PR:
read
stage schemaSetDataSchema
Checklist before requesting a review