huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.93k stars 966 forks source link

WebLoader Support #2083

Open karamavusibrahim opened 1 year ago

karamavusibrahim commented 1 year ago

System Info

CUDA 11.8
My environment is:
      - accelerate==0.24.0
      - braceexpand==0.1.7
      - certifi==2022.12.7
      - charset-normalizer==2.1.1
      - diffusers==0.21.4
      - filelock==3.9.0
      - fsspec==2023.4.0
      - huggingface-hub==0.17.3
      - idna==3.4
      - imageio==2.31.6
      - importlib-metadata==6.8.0
      - jinja2==3.1.2
      - lazy-loader==0.3
      - markupsafe==2.1.2
      - mpmath==1.3.0
      - natsort==8.4.0
      - networkx==3.0
      - numpy==1.24.1
      - opencv-python==4.8.1.78
      - packaging==23.2
      - pandas==2.1.1
      - pillow==9.3.0
      - psutil==5.9.6
      - python-dateutil==2.8.2
      - pytz==2023.3.post1
      - pyyaml==6.0.1
      - regex==2023.10.3
      - requests==2.28.1
      - safetensors==0.4.0
      - scikit-image==0.22.0
      - scipy==1.11.3
      - six==1.16.0
      - sympy==1.12
      - tifffile==2023.9.26
      - tokenizers==0.14.1
      - torch==2.1.0+cu118
      - torchaudio==2.1.0+cu118
      - torchvision==0.16.0+cu118
      - tqdm==4.66.1
      - transformers==4.34.1
      - triton==2.1.0
      - typing-extensions==4.4.0
      - tzdata==2023.3
      - urllib3==1.26.13
      - webdataset==0.2.62
      - zipp==3.17.0

Information

Tasks

Reproduction

I am trying to use webdataset and webloader in my training code.I can run my code without any error if i am using webdataset and dataloader of torch,however if i want to use webdataset and webloader,i am getting this error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking argument for argument weight in method wrapper_CUDA___slow_conv2d_forward)

Expected behavior

Works without any error as webdataset+dataloader configuration

BenjaminBossan commented 1 year ago

This is going to be hard to debug without the code that produces the error. Would it be possible for you to share it, at least a minimal version to reproduce the error?

karamavusibrahim commented 1 year ago
url = #my tar file paths
preprocess_wds = #transform functions
train_dataset = wds.WebDataset(url,resampled=True).shuffle(1000)
train_dataset = (train_dataset.decode("pil",handler=wds.warn_and_continue).to_tuple("jpg;png"))
train_dataset = train_dataset.map(preprocess_wds)
train_dataset = train_dataset.with_epoch(10000)

train_dataloader = torch.utils.data.DataLoader(train_dataset,num_workers=12, batch_size=args.train_batch_size,shuffle=False,persistent_workers=True,collate_fn=collate_fn) #it works

train_dataloader = wds.WebLoader(train_dataset,batch_size=args.train_batch_size,collate_fn=collate_fn, num_workers=12,shuffle=False,persistent_workers=True)# if i use instead of above dataloader, i am getting error
BenjaminBossan commented 1 year ago

So can you use PyTorch's DataLoader or do you need to use WebLoader? I tried to check quickly, WebLoader doesn't seem to be an instance of DataLoader, which could be problematic.

karamavusibrahim commented 1 year ago

According to my tests, using webdataset+webloader is faster than using webdataset+dataloader, so I try to use the webdataset+webloader config.

BenjaminBossan commented 1 year ago

If possible, could you please test something. Depending on the result, it might give us an idea on how to fix the issue for good. The test would be to define a custom data loader and use it instead:

class MyLoader(wds.WebLoader, torch.utils.data.DataLoader):
    pass

train_dataloader = MyLoader(train_dataset, ...)

If you could report back if this works and if it attains the same speed, that would be awesome.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

boqian-li commented 9 months ago

System Info

CUDA 11.8
My environment is:
      - accelerate==0.24.0
      - braceexpand==0.1.7
      - certifi==2022.12.7
      - charset-normalizer==2.1.1
      - diffusers==0.21.4
      - filelock==3.9.0
      - fsspec==2023.4.0
      - huggingface-hub==0.17.3
      - idna==3.4
      - imageio==2.31.6
      - importlib-metadata==6.8.0
      - jinja2==3.1.2
      - lazy-loader==0.3
      - markupsafe==2.1.2
      - mpmath==1.3.0
      - natsort==8.4.0
      - networkx==3.0
      - numpy==1.24.1
      - opencv-python==4.8.1.78
      - packaging==23.2
      - pandas==2.1.1
      - pillow==9.3.0
      - psutil==5.9.6
      - python-dateutil==2.8.2
      - pytz==2023.3.post1
      - pyyaml==6.0.1
      - regex==2023.10.3
      - requests==2.28.1
      - safetensors==0.4.0
      - scikit-image==0.22.0
      - scipy==1.11.3
      - six==1.16.0
      - sympy==1.12
      - tifffile==2023.9.26
      - tokenizers==0.14.1
      - torch==2.1.0+cu118
      - torchaudio==2.1.0+cu118
      - torchvision==0.16.0+cu118
      - tqdm==4.66.1
      - transformers==4.34.1
      - triton==2.1.0
      - typing-extensions==4.4.0
      - tzdata==2023.3
      - urllib3==1.26.13
      - webdataset==0.2.62
      - zipp==3.17.0

Information

  • [ ] The official example scripts
  • [x] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [x] My own task or dataset (give details below)

Reproduction

I am trying to use webdataset and webloader in my training code.I can run my code without any error if i am using webdataset and dataloader of torch,however if i want to use webdataset and webloader,i am getting this error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking argument for argument weight in method wrapper_CUDA___slow_conv2d_forward)

Expected behavior

Works without any error as webdataset+dataloader configuration

HI, I wanna know how to use webdataset with torch.Dataloader. I also want to use accelerate with them. Thanks

edgarriba commented 1 month ago

@karamavusibrahim any progress here ? We are facing similar issues while combining Webdataloader with Accelerate, ending up using Webdataset + torch dataloader. @muellerzr Any more insights here ?

Monohydroxides commented 5 days ago

@karamavusibrahim any progress here ? We are facing similar issues while combining Webdataloader with Accelerate, ending up using Webdataset + torch dataloader. @muellerzr Any more insights here ?

Could you please give the detailed code about how to use Webdataset and torch dataloader?