uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Support local cache when predicates are used #309

Open selitvin opened 5 years ago

selitvin commented 5 years ago

Follow up on #306

Currently, we don't support using predicates with cache enabled.

Two possible alternatives for implementation:

I tend to think the first alternative is preferable since it leads to a more predictable behavior (no need for a predicate function to be pure) and seems to serve the scenario of reducing network traffic well. The cost of the local disk storage (especially if sharding and rowgroup selectors are used to reduce the amount of rowgroups scheduled for loading) seems less important.

GregAru commented 5 years ago

To follow up with the first approach (Caching entire row group before predicate evaluation) Could the following be an option for PyDictReaderWorker?

def _local_cache(func):
    def wrapper(*args, **kwargs):
        (reader_worker, piece, pq_file, column_names, shuffle_row_drop_partition) = args
        cache = reader_worker._local_cache
        dataset_path = reader_worker._dataset_path
        if not isinstance(cache, NullCache):
            dataset_path_hash = hashlib.md5(dataset_path.encode('utf-8')).hexdigest()
            column_names_hash = hashlib.md5(("_".join(column_names)).encode('utf-8')).hexdigest()
            cache_key = '{}:{}:{}'.format(dataset_path_hash, piece.path, column_names_hash)
            records = cache.get(cache_key, lambda: func(*args, **kwargs))
            return records
        else:
            func(*args, **kwargs)
    return wrapper

And then, within PyDictReaderWorker:

    @_local_cache
    def _read_with_shuffle_row_drop(self, piece, pq_file, column_names, shuffle_row_drop_partition):
     ...

This way the results are cached before predicate is applied and the data is cached in encoded form which takes less space.

selitvin commented 5 years ago

Something along these lines would be good. I would advise against making it a decorator. It is hardly reusable in other places as it assumes a lot about the arguments and their order. If one would add/reorder argument of this function, it would be hard to trace the code down to the wrapper in order to make an adjustment. I think just a plain Python, without use of decorator implementation is preferable. Otherwise, I think the logic is good.

Are you planning to submit a PR?