.map() cache is wrongfully reused - only happens when the mapping function is imported

eladsegal commented 2 years ago

Describe the bug

When .map is used with a mapping function that is imported, the cache is reused even if the mapping function has been modified. The reason for this is that dill that is used for creating the fingerprint pickles imported functions by reference.

I guess it is not a widespread case, but it can still lead to unwanted results unnoticeably.

Steps to reproduce the bug

Create files a.py and b.py:

# a.py
from datasets import load_dataset

def main():
    squad = load_dataset("squad")
    squad.map(mapping_func, batched=True)

def mapping_func(examples):
    ID_LENGTH = 4
    examples["id"] = [id_[:ID_LENGTH] for id_ in examples["id"]]
    return examples

if __name__ == "__main__":
    main()

# b.py
from datasets import load_dataset
from a import mapping_func

def main():
    squad = load_dataset("squad")
    squad.map(mapping_func, batched=True)

if __name__ == "__main__":
    main()

Run python b.py twice: In the first run you will see tqdm bars showing that the data is processed, and in the second run you will see "Loading cached processed dataset at...". Now change ID_LENGTH to another number in order to change the mapping function, and run python b.py again. You'll see that .map loads from the cache the result of the previous mapping function.

Expected results

Run python a.py twice: In the first run you will see tqdm bars showing that the data is processed, and in the second run you will see "Loading cached processed dataset at...". Now change ID_LENGTH to another number in order to change the mapping function, and run python a.py again. You'll see that the dataset is being processed and that there's no reuse of the previous mapping function result.

Workaround

Put the mapping function inside a dummy class as a static method:

# a.py
class MappingFuncClass:
    @staticmethod
    def mapping_func(examples):
        ID_LENGTH = 4
        examples["id"] = [id_[:ID_LENGTH] for id_ in examples["id"]]
        return examples

# b.py
from datasets import load_dataset
from a import MappingFuncClass

def main():
    squad = load_dataset("squad")
    squad.map(MappingFuncClass.mapping_func, batched=True)

if __name__ == "__main__":
    main()

Environment info

datasets version: 1.15.1
Platform: Linux-4.4.0-19041-Microsoft-x86_64-with-glibc2.17
Python version: 3.8.10
PyArrow version: 4.0.1

lhoestq commented 2 years ago

Hi ! Thanks for reporting. Indeed this is a current limitation of the usage we have of dill in datasets. I'd suggest you use your workaround for now until we find a way to fix this. Maybe functions that are not coming from a module not installed with pip should be dumped completely, rather than only taking their locations into account

eladsegal commented 2 years ago

I agree. Sounds like a solution for it would be pretty dirty, even cloudpickle doesn't help in this case. In the meanwhile I think that adding a warning and the workaround somewhere in the documentation can be helpful.

eladsegal commented 1 year ago

For anyone interested, I see that with dill==0.3.6 the workaround I suggested doesn't work anymore. I opened an issue about it: https://github.com/uqfoundation/dill/issues/572.

huggingface / datasets