huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.02k stars 27.01k forks source link

Pickled custom model with '.' and trust_remote_code=True and set_start_method("spawn") raises ModuleNotFoundError #34303

Open rsamf opened 3 weeks ago

rsamf commented 3 weeks ago

System Info

Who can help?

@Rocketknight1 @not-lain

Information

Tasks

Reproduction

I am trying to use a model with custom architecture that has a '.' in its name: "briaai/RMBG-1.4". Historically, there was issues with this and it got resolved (see #29251). Now, I'm doing something more niche that requires me to pickle the model and send it to another process, that is started with the "spawn" method with set_start_method("spawn").

See the following minimal reproducible snippet:


from transformers import pipeline
from datasets import load_dataset
import torch.multiprocessing as mp

def m(arg):
    pass

if __name__ == "__main__":
    mp.set_start_method("spawn", force=True)

    pipe = pipeline(model="briaai/RMBG-1.4", device="cpu", trust_remote_code=True)
    dataset = load_dataset("microsoft/cats_vs_dogs", split="train")
    with mp.Pool(1) as p:
        arg = (pipe, dataset)
        p.map(m, (arg,))

The same error from #29251 shows up:

ModuleNotFoundError: No module named 'transformers_modules.briaai.RMBG-1'

However, when the spawn method is "fork", it works fine.

It appears that processes started with the "spawn" method can't find the custom huggingface modules with '.' still, but the processes with "fork" are still good.

I understand that I don't need to pickle the model and send it to other processes by keeping it top-level, but it would be convenient for my project if I could.

Expected behavior

For the module transformers_modules.briaai.RMBG-1.4 to be found.

Rocketknight1 commented 3 weeks ago

Just making sure, have you tried this code with a repo that doesn't have a . in its name?

not-lain commented 3 weeks ago

Hi @rsamf I couldn't reproduce the error on colab using your reproducer code. can you confirm that your transformers version is >=4.39.1 ? if yes try running the command transformers-cli env and past your environment version here.

rsamf commented 3 weeks ago

Just making sure, have you tried this code with a repo that doesn't have a . in its name?

Yes I have. Sorry for not mentioning that.

rsamf commented 3 weeks ago

Hi @rsamf

I couldn't reproduce the error on colab using your reproducer code.

can you confirm that your transformers version is >=4.39.1 ?

if yes try running the command transformers-cli env and past your environment version here.

transformers version: 4.44.2

You should be able to see it in my post under System Info

Rocketknight1 commented 3 weeks ago

I can reproduce the error when I run that script! However, I'm not sure if our models are intended to be pickle-safe - diagnosing issues that involve both the import machinery and multiprocessing will likely be very annoying, so we probably can't prioritize this one! I'd accept a PR if anyone can figure it out, though.

rsamf commented 3 weeks ago

Thanks @Rocketknight1 and @not-lain for the quick responses.

Just to put a little bit more context, I am trying to maximize the utilization of my GPU by parallelizing just the preprocess step in pipelines. Some of the pipelines such as RMBG-1.4 send their resulting tensors to cuda requiring me to use the "spawn" method. Note that my snippet doesn't use cuda because that part is not the issue. However, in my experience most pipelines don't use cuda in the preprocess step which allows me to use "fork" which is less error prone and doesn't cause the ModuleNotFoundError. Also, with spawn, I have to annoyingly delete the cuda tensors from the producer process and return cloned cpu ones.

I understand that my problem is likely rare compared to the rest of the community, and even I am starting to think of stopping my investigation of the issue because of other unforeseen issues with the "spawn" start method. However, I will leave a more relatable snippet below, just in case:

Does not work:

from transformers import pipeline
from datasets import load_dataset
import torch.multiprocessing as mp
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import Dataset

class Preprocess(Dataset):
    def __init__(self, dataset, preprocess_fn):
        self.dataset = dataset
        self.preprocess_fn = preprocess_fn

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        print("Script should fail before reaching this point")
        return self.preprocess_fn(self.dataset[idx]["image"])

if __name__ == "__main__":
    mp.set_start_method("spawn", force=True)

    pipe = pipeline(model="briaai/RMBG-1.4", device="cpu", trust_remote_code=True)
    dataset = load_dataset("microsoft/cats_vs_dogs", split="train")
    dataset = Preprocess(dataset, pipe.preprocess)
    dataloader = DataLoader(dataset, 4, num_workers=2)

    for batch in dataloader:
        break

This one works:

from transformers import pipeline
from datasets import load_dataset
import torch.multiprocessing as mp
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import Dataset

class Preprocess(Dataset):
    def __init__(self, dataset, preprocess_fn):
        self.dataset = dataset
        self.preprocess_fn = preprocess_fn

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        print("Works!")
        return self.preprocess_fn(self.dataset[idx]["image"])

if __name__ == "__main__":
    mp.set_start_method("spawn", force=True)

    pipe = pipeline(model="microsoft/resnet-50", device="cpu") # changed to an officially supported architecture
    dataset = load_dataset("microsoft/cats_vs_dogs", split="train")
    dataset = Preprocess(dataset, pipe.preprocess)
    dataloader = DataLoader(dataset, 4, num_workers=2)

    for batch in dataloader:
        break