linkedin / openhouse

Open Control Plane for Tables in Data Lakehouse
https://www.openhousedb.org/
BSD 2-Clause "Simplified" License
273 stars 43 forks source link

Data layout optimization (strategy generation). Part 3: compaction strategy generation with cost/gain scores #116

Open teamurko opened 1 month ago

teamurko commented 1 month ago

Summary

This is part 3 of a new feature: data layout optimization library, strategy generation. This PR is co-authored with @anjagruenheid.

Added compaction strategy generation with rewrite cost as serial rewrite time and rewrite gain as time-saving from number of files reduced. This PR builds on top of https://github.com/linkedin/openhouse/pull/109

The following 3 components will be added eventually: 1) DLO library that has primitives for generating data layout optimization strategies 2) App that generates strategies for all tables 3) Scheduling of the app

Changes

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

For all the boxes checked, include additional details of the changes made in this pull request.

sumedhsakdeo commented 2 weeks ago

Review posted on https://github.com/teamurko/openhouse/pull/2