Open selitvin opened 5 years ago
To follow up with the first approach (Caching entire row group before predicate evaluation)
Could the following be an option for PyDictReaderWorker
?
def _local_cache(func):
def wrapper(*args, **kwargs):
(reader_worker, piece, pq_file, column_names, shuffle_row_drop_partition) = args
cache = reader_worker._local_cache
dataset_path = reader_worker._dataset_path
if not isinstance(cache, NullCache):
dataset_path_hash = hashlib.md5(dataset_path.encode('utf-8')).hexdigest()
column_names_hash = hashlib.md5(("_".join(column_names)).encode('utf-8')).hexdigest()
cache_key = '{}:{}:{}'.format(dataset_path_hash, piece.path, column_names_hash)
records = cache.get(cache_key, lambda: func(*args, **kwargs))
return records
else:
func(*args, **kwargs)
return wrapper
And then, within PyDictReaderWorker
:
@_local_cache
def _read_with_shuffle_row_drop(self, piece, pq_file, column_names, shuffle_row_drop_partition):
...
This way the results are cached before predicate is applied and the data is cached in encoded form which takes less space.
Something along these lines would be good. I would advise against making it a decorator. It is hardly reusable in other places as it assumes a lot about the arguments and their order. If one would add/reorder argument of this function, it would be hard to trace the code down to the wrapper
in order to make an adjustment. I think just a plain Python, without use of decorator implementation is preferable. Otherwise, I think the logic is good.
Are you planning to submit a PR?
Follow up on #306
Currently, we don't support using predicates with cache enabled.
Two possible alternatives for implementation:
I tend to think the first alternative is preferable since it leads to a more predictable behavior (no need for a predicate function to be pure) and seems to serve the scenario of reducing network traffic well. The cost of the local disk storage (especially if sharding and rowgroup selectors are used to reduce the amount of rowgroups scheduled for loading) seems less important.