huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.26k stars 2.7k forks source link

.map() cache is wrongfully reused - only happens when the mapping function is imported #3297

Open eladsegal opened 2 years ago

eladsegal commented 2 years ago

Describe the bug

When .map is used with a mapping function that is imported, the cache is reused even if the mapping function has been modified. The reason for this is that dill that is used for creating the fingerprint pickles imported functions by reference.

I guess it is not a widespread case, but it can still lead to unwanted results unnoticeably.

Steps to reproduce the bug

Create files a.py and b.py:

# a.py
from datasets import load_dataset

def main():
    squad = load_dataset("squad")
    squad.map(mapping_func, batched=True)

def mapping_func(examples):
    ID_LENGTH = 4
    examples["id"] = [id_[:ID_LENGTH] for id_ in examples["id"]]
    return examples

if __name__ == "__main__":
    main()
# b.py
from datasets import load_dataset
from a import mapping_func

def main():
    squad = load_dataset("squad")
    squad.map(mapping_func, batched=True)

if __name__ == "__main__":
    main()

Run python b.py twice: In the first run you will see tqdm bars showing that the data is processed, and in the second run you will see "Loading cached processed dataset at...". Now change ID_LENGTH to another number in order to change the mapping function, and run python b.py again. You'll see that .map loads from the cache the result of the previous mapping function.

Expected results

Run python a.py twice: In the first run you will see tqdm bars showing that the data is processed, and in the second run you will see "Loading cached processed dataset at...". Now change ID_LENGTH to another number in order to change the mapping function, and run python a.py again. You'll see that the dataset is being processed and that there's no reuse of the previous mapping function result.

Workaround

Put the mapping function inside a dummy class as a static method:

# a.py
class MappingFuncClass:
    @staticmethod
    def mapping_func(examples):
        ID_LENGTH = 4
        examples["id"] = [id_[:ID_LENGTH] for id_ in examples["id"]]
        return examples
# b.py
from datasets import load_dataset
from a import MappingFuncClass

def main():
    squad = load_dataset("squad")
    squad.map(MappingFuncClass.mapping_func, batched=True)

if __name__ == "__main__":
    main()

Environment info

lhoestq commented 2 years ago

Hi ! Thanks for reporting. Indeed this is a current limitation of the usage we have of dill in datasets. I'd suggest you use your workaround for now until we find a way to fix this. Maybe functions that are not coming from a module not installed with pip should be dumped completely, rather than only taking their locations into account

eladsegal commented 2 years ago

I agree. Sounds like a solution for it would be pretty dirty, even cloudpickle doesn't help in this case. In the meanwhile I think that adding a warning and the workaround somewhere in the documentation can be helpful.

eladsegal commented 1 year ago

For anyone interested, I see that with dill==0.3.6 the workaround I suggested doesn't work anymore. I opened an issue about it: https://github.com/uqfoundation/dill/issues/572.