Open hackeryang opened 2 years ago
@hackeryang Thanks for your contribution, Feel free to close this issue as your PR is merged.
CC: @nmahadevuni
@hackeryang Thanks for your contribution, Feel free to close this issue as your PR is merged.
CC: @nmahadevuni
Thank you @rohanpednekar , the orc metadata cache still doesn't support timely invalidation, only parquet metadata cache supports that now, i will try to make a PR about orc later~
Close this issue is ok, i can reopen it later when i can contribute about orc metadata cache.
First thanks to our community and all contributors, we are using file/stripe footer cache for ORC and Parquet files mentioned in this issue: https://github.com/prestodb/presto/issues/13205
However I realized that the cache value may be dirty and not correct in some conditions. For example, let's set the parameters in the
hive.properties
of worker nodes like this:Then, I submit the same query multi times, of course the metadata cache of ORC or Parquet will have the cache values about relevent hive files.
Unluckily, 10 minutes later, a daily scheduled ETL job(such as hive or spark insertion job) renewed some partitions of relevent hive tables, then the cache value in our PrestoDB workers will be dirty even incorrect now, but the existing mechanism won't feel the change and invalidate the cache immediately.
We are glad to discuss to anyone interest in this, and if necessary, we can try to contribute our codes to improve this.
One of the solutions to this condition is that the worker nodes also cache the
file modification time
in hive splits received from the coordinator in the first time, then compare this modification time in the cache, with the modification time from the Coordinator again, if it is not equal, then the cache is dirty.Thank you all again.