qubole / rubix

Cache File System optimized for columnar formats and object stores
Apache License 2.0
182 stars 74 forks source link

Macro based whitelisting of locations allowed to cache #29

Open shubhamtagra opened 7 years ago

shubhamtagra commented 7 years ago

This is useful particularly in case of partitioned tables. Today, whitelisting is a regex e.g. if user wants to whitelist two tables which appear in location reviews and bookings under same s3 prefix like s3://mybuckets/tables then the could add this to config:

hadoop.cache.data.location.whitelist=.*mybuckets/tables/(reviews|bookings).*

The problem with this is that if say bookings is partitioned by month and has data for many months while user only wants to cache the data for last two months, user will have to keep updating this config everytime the month change. To solve that, we should provide a macro based input to this config. E.g. if reviews are partitioned yearly and booking monthly and user wants to enable caching for only last 5 years of reviews and last 2 months of bookings, this should be possible:

hadoop.cache.data.location.whitelist=.*mybuckets/tables/(reviews/year=$lastFiveYears$|bookings/month=$lastTwoMonthsNames$).*

Rubix should evaluate the macros $lastFiveYears$ and $lastTwoMonthsNames$ at runtime and come up with the whitelisting config as:

hadoop.cache.data.location.whitelist=.*mybuckets/tables/(reviews/year=(2016|2015|2014|2013|2012)|bookings/month=(October|September)).*

Rubix should provide some of the common functions out of the box and the system should be extendable for user defined macros. E.g. if a particular user has data partitioned by store location as s3://mybucket/tables/stores/location=xyz and wants to only cache data for stores in Bangalore and Pune, he should be able to write a custom function to do it, add that jar and use it in whitelist as:

hadoop.cache.data.location.whitelist=.*mybuckets/tables/stores/location=$com.myCompany.rubix.myCustomStoreSelector$.*
damianmontenegro-upwork commented 3 years ago

Is there any updates on this feature?

goelrajat commented 2 years ago

Is there any plan to implement this feature in Rubix ?