sail-sg / MDT

Masked Diffusion Transformer is the SOTA for image synthesis. (ICCV 2023)
Apache License 2.0
500 stars 35 forks source link

CUDA Assert Error #2

Closed cyrilzakka closed 6 months ago

cyrilzakka commented 1 year ago

Hello,

Great paper! I'm trying to train the same model on a custom dataset, but I'm being met with the following error.

Traceback (most recent call last):
  File "/scratch/users/czakka/MDT/scripts/image_train.py", line 100, in <module>
    main()
  File "/scratch/users/czakka/MDT/scripts/image_train.py", line 46, in main
    TrainLoop(
  File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 178, in run_loop
    self.run_step(batch, cond)
  File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 192, in run_step
    self.forward_backward(batch, cond)
  File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 231, in forward_backward
    losses = compute_losses()
  File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/scratch/users/czakka/MDT/masked_diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 128, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 444, in forward
    y = self.y_embedder(y, self.training)    # (N, D)
  File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 191, in forward
    embeddings = self.embedding_table(labels)
  File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered

It seems like there's an indexing error in the lookup table. Any advice would be greatly appreciated!

gasvn commented 1 year ago

It seems the label form in your dataset is not the same with the default dataset. Or the class number is not correctly set in the model. If you still have this problem, you can provide me with your code so I can help with the debug~

cyrilzakka commented 1 year ago

@gasvn My apologies - how should the dataset be structured? I simply have a parent folder with 19 subfolders each containing images and I've changed the following lines to NUM_CLASSES=19. https://github.com/sail-sg/MDT/blob/bf90054be778fb5a2130baa0f8fb3058672a288a/masked_diffusion/models.py#L270 and https://github.com/sail-sg/MDT/blob/bf90054be778fb5a2130baa0f8fb3058672a288a/masked_diffusion/script_util.py#L7

gasvn commented 1 year ago

The dataloader of this project is borrowed from the ADM repo. This dataloader get classes by the name of the file instead of the folder. For the standard ImageNet dataset, it will make no difference as the name of imagenet is composed of classname+imageid. So you need to rename your image name to classname+imageid.jpg following this link: https://github.com/openai/guided-diffusion/issues/95