Pickled custom model with '.' and trust_remote_code=True and set_start_method("spawn") raises ModuleNotFoundError

rsamf commented 3 weeks ago

System Info

transformers version: 4.44.2
Platform: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.24.6
Safetensors version: 0.4.4
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: yes
Using GPU in script?: no
GPU type: NVIDIA GeForce RTX 3070

Who can help?

@Rocketknight1 @not-lain

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I am trying to use a model with custom architecture that has a '.' in its name: "briaai/RMBG-1.4". Historically, there was issues with this and it got resolved (see #29251). Now, I'm doing something more niche that requires me to pickle the model and send it to another process, that is started with the "spawn" method with set_start_method("spawn").

See the following minimal reproducible snippet:


from transformers import pipeline
from datasets import load_dataset
import torch.multiprocessing as mp

def m(arg):
    pass

if __name__ == "__main__":
    mp.set_start_method("spawn", force=True)

    pipe = pipeline(model="briaai/RMBG-1.4", device="cpu", trust_remote_code=True)
    dataset = load_dataset("microsoft/cats_vs_dogs", split="train")
    with mp.Pool(1) as p:
        arg = (pipe, dataset)
        p.map(m, (arg,))

The same error from #29251 shows up:

ModuleNotFoundError: No module named 'transformers_modules.briaai.RMBG-1'

However, when the spawn method is "fork", it works fine.

It appears that processes started with the "spawn" method can't find the custom huggingface modules with '.' still, but the processes with "fork" are still good.

I understand that I don't need to pickle the model and send it to other processes by keeping it top-level, but it would be convenient for my project if I could.

Expected behavior

For the module transformers_modules.briaai.RMBG-1.4 to be found.

Rocketknight1 commented 3 weeks ago

Just making sure, have you tried this code with a repo that doesn't have a . in its name?

not-lain commented 3 weeks ago

Hi @rsamf I couldn't reproduce the error on colab using your reproducer code. can you confirm that your transformers version is >=4.39.1 ? if yes try running the command transformers-cli env and past your environment version here.

rsamf commented 3 weeks ago

Just making sure, have you tried this code with a repo that doesn't have a . in its name?

Yes I have. Sorry for not mentioning that.

rsamf commented 3 weeks ago

Hi @rsamf

I couldn't reproduce the error on colab using your reproducer code.

can you confirm that your transformers version is >=4.39.1 ?

if yes try running the command transformers-cli env and past your environment version here.

transformers version: 4.44.2

You should be able to see it in my post under System Info

Rocketknight1 commented 3 weeks ago

I can reproduce the error when I run that script! However, I'm not sure if our models are intended to be pickle-safe - diagnosing issues that involve both the import machinery and multiprocessing will likely be very annoying, so we probably can't prioritize this one! I'd accept a PR if anyone can figure it out, though.

rsamf commented 3 weeks ago

Thanks @Rocketknight1 and @not-lain for the quick responses.

Just to put a little bit more context, I am trying to maximize the utilization of my GPU by parallelizing just the preprocess step in pipelines. Some of the pipelines such as RMBG-1.4 send their resulting tensors to cuda requiring me to use the "spawn" method. Note that my snippet doesn't use cuda because that part is not the issue. However, in my experience most pipelines don't use cuda in the preprocess step which allows me to use "fork" which is less error prone and doesn't cause the ModuleNotFoundError. Also, with spawn, I have to annoyingly delete the cuda tensors from the producer process and return cloned cpu ones.

I understand that my problem is likely rare compared to the rest of the community, and even I am starting to think of stopping my investigation of the issue because of other unforeseen issues with the "spawn" start method. However, I will leave a more relatable snippet below, just in case:

Does not work:

from transformers import pipeline
from datasets import load_dataset
import torch.multiprocessing as mp
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import Dataset

class Preprocess(Dataset):
    def __init__(self, dataset, preprocess_fn):
        self.dataset = dataset
        self.preprocess_fn = preprocess_fn

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        print("Script should fail before reaching this point")
        return self.preprocess_fn(self.dataset[idx]["image"])

if __name__ == "__main__":
    mp.set_start_method("spawn", force=True)

    pipe = pipeline(model="briaai/RMBG-1.4", device="cpu", trust_remote_code=True)
    dataset = load_dataset("microsoft/cats_vs_dogs", split="train")
    dataset = Preprocess(dataset, pipe.preprocess)
    dataloader = DataLoader(dataset, 4, num_workers=2)

    for batch in dataloader:
        break

This one works:

from transformers import pipeline
from datasets import load_dataset
import torch.multiprocessing as mp
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import Dataset

class Preprocess(Dataset):
    def __init__(self, dataset, preprocess_fn):
        self.dataset = dataset
        self.preprocess_fn = preprocess_fn

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        print("Works!")
        return self.preprocess_fn(self.dataset[idx]["image"])

if __name__ == "__main__":
    mp.set_start_method("spawn", force=True)

    pipe = pipeline(model="microsoft/resnet-50", device="cpu") # changed to an officially supported architecture
    dataset = load_dataset("microsoft/cats_vs_dogs", split="train")
    dataset = Preprocess(dataset, pipe.preprocess)
    dataloader = DataLoader(dataset, 4, num_workers=2)

    for batch in dataloader:
        break

huggingface / transformers