[BUG] TypeError: unsupported operand type(s) for /=: 'NoneType' and 'int'

Thomas2419 commented 1 year ago

🐛 Describe the bug

This bug occurs when trying to train using the train_collossalai.yaml ─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /media/thomas/108E73348E731208/Users/Thoma/Desktop/dndiffusion/ColossalAI/examples/images/diffus │ │ ion/main.py:804 in │ │ │ │ 801 │ │ # run │ │ 802 │ │ if opt.train: │ │ 803 │ │ │ try: │ │ ❱ 804 │ │ │ │ trainer.fit(model, data) │ │ 805 │ │ │ except Exception: │ │ 806 │ │ │ │ melk() │ │ 807 │ │ │ │ raise │ │ │ │ /home/thomas/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py │ │ :599 in fit │ │ │ │ 596 │ │ if not isinstance(model, pl.LightningModule): │ │ 597 │ │ │ raise TypeError(f"Trainer.fit() requires a LightningModule, got: {model. │ │ 598 │ │ self.strategy._lightning_module = model │ │ ❱ 599 │ │ call._call_and_handle_interrupt( │ │ 600 │ │ │ self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, │ │ 601 │ │ ) │ │ 602 │ │ │ │ /home/thomas/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:36 │ │ in _call_and_handle_interrupt │ │ │ │ 33 │ """ │ │ 34 │ try: │ │ 35 │ │ if trainer.strategy.launcher is not None: │ │ ❱ 36 │ │ │ return trainer.strategy.launcher.launch(trainer_fn, args, trainer=trainer, │ │ 37 │ │ else: │ │ 38 │ │ │ return trainer_fn(args, *kwargs) │ │ 39 │ │ │ │ /home/thomas/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/launche │ │ rs/subprocess_script.py:88 in launch │ │ │ │ 85 │ │ """ │ │ 86 │ │ if not self.cluster_environment.creates_processes_externally: │ │ 87 │ │ │ self._call_children_scripts() │ │ ❱ 88 │ │ return function(args, *kwargs) │ │ 89 │ │ │ 90 │ def _call_children_scripts(self) -> None: │ │ 91 │ │ # bookkeeping of spawned processes │ │ │ │ /home/thomas/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py │ │ :641 in _fit_impl │ │ │ │ 638 │ │ │ model_provided=True, │ │ 639 │ │ │ model_connected=self.lightning_module is not None, │ │ 640 │ │ ) │ │ ❱ 641 │ │ self._run(model, ckpt_path=self.ckpt_path) │ │ 642 │ │ │ │ 643 │ │ assert self.state.stopped │ │ 644 │ │ self.training = False │ │ │ │ /home/thomas/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py │ │ :1075 in _run │ │ │ │ 1072 │ │ self._logger_connector.reset_metrics() │ │ 1073 │ │ │ │ 1074 │ │ # strategy will configure model and move it to the device │ │ ❱ 1075 │ │ self.strategy.setup(self) │ │ 1076 │ │ │ │ 1077 │ │ # hook │ │ 1078 │ │ if self.state.fn == TrainerFn.FITTING: │ │ │ │ /home/thomas/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossa │ │ lai.py:367 in setup │ │ │ │ 364 │ │ self.lightning_module._device = self.root_device │ │ 365 │ │ self.ignore_no_grad_parameters(self.root_device) │ │ 366 │ │ self.setup_optimizers(trainer) │ │ ❱ 367 │ │ self.setup_precision_plugin() │ │ 368 │ │ self.model_to_device() │ │ 369 │ │ │ 370 │ def ignore_no_grad_parameters(self, running_device: torch.device) -> None: │ │ │ │ /home/thomas/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossa │ │ lai.py:303 in setup_precision_plugin │ │ │ │ 300 │ │ │ │ min_chunk_size_mb: float = self.chunk_size_search_kwargs.get( │ │ 301 │ │ │ │ │ "min_chunk_size", 32 1048576 │ │ 302 │ │ │ │ ) # type: ignore[assignment] │ │ ❱ 303 │ │ │ │ min_chunk_size_mb /= 1048576 │ │ 304 │ │ │ │ │ │ 305 │ │ │ │ model = _LightningModuleWrapperBase(self.model) │ │ 306 │ │ │ │ self.model = GeminiDDP( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: unsupported operand type(s) for /=: 'NoneType' and 'int'

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /media/thomas/108E73348E731208/Users/Thoma/Desktop/dndiffusion/ColossalAI/examples/images/diffus │ │ ion/main.py:806 in │ │ │ │ 803 │ │ │ try: │ │ 804 │ │ │ │ trainer.fit(model, data) │ │ 805 │ │ │ except Exception: │ │ ❱ 806 │ │ │ │ melk() │ │ 807 │ │ │ │ raise │ │ 808 │ │ # if not opt.no_test and not trainer.interrupted: │ │ 809 │ │ # trainer.test(model, data) │ │ │ │ /media/thomas/108E73348E731208/Users/Thoma/Desktop/dndiffusion/ColossalAI/examples/images/diffus │ │ ion/main.py:789 in melk │ │ │ │ 786 │ │ │ if trainer.global_rank == 0: │ │ 787 │ │ │ │ print("Summoning checkpoint.") │ │ 788 │ │ │ │ ckpt_path = os.path.join(ckptdir, "last.ckpt") │ │ ❱ 789 │ │ │ │ trainer.save_checkpoint(ckpt_path) │ │ 790 │ │ │ │ 791 │ │ def divein(*args, **kwargs): │ │ 792 │ │ │ if trainer.global_rank == 0: │ │ │ │ /home/thomas/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py │ │ :1937 in save_checkpoint │ │ │ │ 1934 │ │ │ │ "Saving a checkpoint is only possible if a model is attached to the Trai │ │ 1935 │ │ │ │ " Trainer.save_checkpoint() before calling Trainer.{fit,validate,test │ │ 1936 │ │ │ ) │ │ ❱ 1937 │ │ self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, │ │ 1938 │ │ │ 1939 │ """ │ │ 1940 │ Parsing properties │ │ │ │ /home/thomas/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors │ │ /checkpoint_connector.py:512 in save_checkpoint │ │ │ │ 509 │ │ │ weights_only: saving model weights only │ │ 510 │ │ │ storage_options: parameter for how to save to storage, passed to ``Checkpoin │ │ 511 │ │ """ │ │ ❱ 512 │ │ _checkpoint = self.dump_checkpoint(weights_only) │ │ 513 │ │ self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=sto │ │ 514 │ │ │ 515 │ def _get_lightning_module_state_dict(self) -> Dict[str, Tensor]: │ │ │ │ /home/thomas/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors │ │ /checkpoint_connector.py:444 in dump_checkpoint │ │ │ │ 441 │ │ │ "epoch": self.trainer.current_epoch, │ │ 442 │ │ │ "global_step": self.trainer.global_step, │ │ 443 │ │ │ "pytorch-lightning_version": pl.__version__, │ │ ❱ 444 │ │ │ "state_dict": self._get_lightning_module_state_dict(), │ │ 445 │ │ │ "loops": self._get_loops_state_dict(), │ │ 446 │ │ } │ │ 447 │ │ │ │ /home/thomas/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors │ │ /checkpoint_connector.py:526 in _get_lightning_module_state_dict │ │ │ │ 523 │ │ │ metric.persistent(True) │ │ 524 │ │ │ metric.sync() │ │ 525 │ │ │ │ ❱ 526 │ │ state_dict = self.trainer.strategy.lightning_module_state_dict() │ │ 527 │ │ │ │ 528 │ │ for metric in metrics: │ │ 529 │ │ │ # sync can be a no-op (e.g. on cpu) sounsync` would raise a user error exc │ │ │ │ /home/thomas/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossa │ │ lai.py:426 in lightning_module_state_dict │ │ │ │ 423 │ │ with _patch_cuda_is_available(): │ │ 424 │ │ │ from colossalai.nn.parallel import ZeroDDP │ │ 425 │ │ │ │ ❱ 426 │ │ assert isinstance(self.model, ZeroDDP) │ │ 427 │ │ org_dict = self.model.state_dict(only_rank_0=rank_zero_only) │ │ 428 │ │ │ │ 429 │ │ children = list(self.model.named_children()) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AssertionError

Environment

Using the Conda Environment as given in the repository. Cuda supported up to 11.8 using Ubuntu 20.04. Nvidia Driver 525 Proprietary.

PIP freeze: absl-py==1.3.0 accelerate==0.15.0 aiohttp==3.8.3 aiosignal==1.3.1 albumentations==1.3.0 altair==4.2.0 antlr4-python3-runtime==4.8 async-timeout==4.0.2 attrs==22.2.0 bcrypt==4.0.1 blinker==1.5 braceexpand==0.1.7 brotlipy==0.7.0 cachetools==5.2.0 certifi @ file:///croot/certifi_1671487769961/work/certifi cffi @ file:///tmp/abs_98z5h56wf8/croots/recipe/cffi_1659598650955/work cfgv==3.3.1 charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work click==8.1.3 coloredlogs==15.0.1 colossalai==0.1.12+torch1.12cu11.3 commonmark==0.9.1 contexttimer==0.3.3 cryptography @ file:///croot/cryptography_1665612644927/work datasets==2.8.0 decorator==5.1.1 diffusers==0.11.1 dill==0.3.6 distlib==0.3.6 einops==0.3.0 entrypoints==0.4 fabric==2.7.1 filelock==3.8.2 flatbuffers==22.12.6 flit-core @ file:///opt/conda/conda-bld/flit-core_1644941570762/work/source/flit_core frozenlist==1.3.3 fsspec==2022.11.0 ftfy==6.1.1 future==0.18.2 gitdb==4.0.10 GitPython==3.1.29 google-auth==2.15.0 google-auth-oauthlib==0.4.6 grpcio==1.51.1 huggingface-hub==0.11.1 humanfriendly==10.0 identify==2.5.11 idna @ file:///croot/idna_1666125576474/work imageio==2.9.0 imageio-ffmpeg==0.4.2 importlib-metadata==5.2.0 invisible-watermark==0.1.5 invoke==1.7.3 Jinja2==3.1.2 joblib==1.2.0 jsonschema==4.17.3 kornia==0.6.0 latent-diffusion @ file:///media/thomas/108E73348E731208/Users/Thoma/Desktop/dndiffusion/ColossalAI/examples/images/diffusion lightning-utilities==0.5.0 Markdown==3.4.1 MarkupSafe==2.1.1 mkl-fft==1.3.1 mkl-random @ file:///tmp/build/80754af9/mkl_random_1626186066731/work mkl-service==2.4.0 modelcards==0.1.6 mpmath==1.2.1 multidict==6.0.4 multiprocess==0.70.14 networkx==2.8.8 nodeenv==1.7.0 numpy @ file:///tmp/abs_653_j00fmm/croots/recipe/numpy_and_numpy_base_1659432701727/work oauthlib==3.2.2 omegaconf==2.1.1 onnx==1.13.0 onnxruntime==1.13.1 open-clip-torch==2.0.2 opencv-python==4.6.0.66 opencv-python-headless==4.6.0.66 packaging==22.0 pandas==1.5.2 paramiko==2.12.0 pathlib2==2.3.7.post1 Pillow==9.3.0 platformdirs==2.6.0 pre-commit==2.21.0 prefetch-generator==1.0.3 protobuf==3.20.1 psutil==5.9.4 pyarrow==10.0.1 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work pydeck==0.8.0 pyDeprecate==0.3.2 Pygments==2.13.0 Pympler==1.0.1 PyNaCl==1.5.0 pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work pyrsistent==0.19.2 PySocks @ file:///tmp/build/80754af9/pysocks_1605305812635/work python-dateutil==2.8.2 pytorch-lightning @ file:///media/thomas/108E73348E731208/Users/Thoma/Desktop/dndiffusion/ColossalAI/examples/images/diffusion/lightning pytz==2022.7 pytz-deprecation-shim==0.1.0.post0 PyWavelets==1.4.1 PyYAML==6.0 qudida==0.0.4 regex==2022.10.31 requests @ file:///opt/conda/conda-bld/requests_1657734628632/work requests-oauthlib==1.3.1 responses==0.18.0 rich==12.6.0 rsa==4.9 scikit-image==0.19.3 scikit-learn==1.2.0 scipy==1.9.3 semver==2.13.0 six @ file:///tmp/build/80754af9/six_1644875935023/work smmap==5.0.0 streamlit==1.12.1 streamlit-drawable-canvas==0.8.0 sympy==1.11.1 tensorboard==2.11.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorboardX==2.5.1 test-tube==0.7.5 threadpoolctl==3.1.0 tifffile==2022.10.10 tokenizers==0.12.1 toml==0.10.2 toolz==0.12.0 torch==1.12.1 torchmetrics==0.7.0 torchvision==0.13.1 tornado==6.2 tqdm==4.64.1 transformers==4.25.1 triton==1.1.1 typing-extensions @ file:///croot/typing_extensions_1669924550328/work tzdata==2022.7 tzlocal==4.2 urllib3 @ file:///croot/urllib3_1670526988650/work validators==0.20.0 virtualenv==20.17.1 watchdog==2.2.0 wcwidth==0.2.5 webdataset==0.2.5 Werkzeug==2.2.2 xformers==0.0.15.dev395+git.7e05e2c xxhash==3.1.0 yarl==1.8.2 zipp==3.11.0

P.S. While unrelated to the bug, I wanted to thank you for doing the work that you do. You're all so very amazing.

haoli-zbdbc commented 1 year ago

I have the same problem, how to solve that, please

1SAA commented 1 year ago

Hi @LhaoH @Thomas2419

I've updated the strategy/colossalai branch in my lightning repo and fixed this bug. You guys can try it out. Thank you for reporting bugs and feel free to discuss any problem here. I will merge this newest hotfix to the master branch of Lightning as soon as possible.

Thomas2419 commented 1 year ago

Hello thank you for the reply, I manually put the fix in and it worked thank you.

haoli-zbdbc commented 1 year ago

@1SAA
Hello, thank you very much for fixing this BUG, my problem has been solved. But after the training, I couldn't deduce a normal picture from prompt.

This is my reasoning code: python scripts/txt2img.py --prompt "photo of a man wearing a pure white shir and a long pants" --plms \ --outdir ./output \ --config /tmp/2022-12-28T09-59-07_train_colossalai_teyvat/configs/2022-12-28T09-59-07-project.yaml \ --ckpt /tmp/2022-12-28T09-59-07_train_colossalai_teyvat/checkpoints/last.ckpt \ --n_samples 4 This is my result: Is it because only two epochs are set, or is batchsize too small? My dataset has more than 4000 images, my profile is as follows:

model:
  base_learning_rate: 1.0e-4
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    parameterization: "v"
    linear_start: 0.00085
    linear_end: 0.0120
    num_timesteps_cond: 1
    log_every_t: 200
    timesteps: 1000
    first_stage_key: image
    cond_stage_key: txt
    image_size: 64
    channels: 4
    cond_stage_trainable: false
    conditioning_key: crossattn
    monitor: val/loss_simple_ema
    scale_factor: 0.18215
    use_ema: False # we set this to false because this is an inference only config


    scheduler_config: # 10000 warmup steps
      target: ldm.lr_scheduler.LambdaLinearScheduler
      params:
        warm_up_steps: [ 1 ] # NOTE for resuming. use 10000 if starting from scratch
        cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
        f_start: [ 1.e-6 ]
        f_max: [ 1.e-4 ]
        f_min: [ 1.e-10 ]

    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        use_checkpoint: True
        use_fp16: True
        image_size: 32 # unused
        in_channels: 4
        out_channels: 4
        model_channels: 320
        attention_resolutions: [ 4, 2, 1 ]
        num_res_blocks: 2
        channel_mult: [ 1, 2, 4, 4 ]
        num_head_channels: 64 # need to fix for flash-attn
        use_spatial_transformer: True
        use_linear_in_transformer: True
        transformer_depth: 1
        context_dim: 1024
        legacy: False

    first_stage_config:
      target: ldm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        monitor: val/rec_loss
        ddconfig:
          #attn_type: "vanilla-xformers"
          double_z: true
          z_channels: 4
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity

    cond_stage_config:
      target: ldm.modules.encoders.modules.FrozenOpenCLIPEmbedder
      params:
        freeze: True
        layer: "penultimate"

data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 16
    num_workers: 4
    train:
      target: ldm.data.teyvat.hf_dataset
      params:
        path: zbdbc/fashion
        image_transforms:
        - target: torchvision.transforms.Resize
          params:
            size: 512
        - target: torchvision.transforms.RandomCrop
          params:
            size: 512
        - target: torchvision.transforms.RandomHorizontalFlip

lightning:
  trainer:
    accelerator: 'gpu'
    devices: 4
    log_gpu_memory: all
    max_epochs: 2
    precision: 16
    auto_select_gpus: False
    strategy:
      target: strategies.ColossalAIStrategy
      params:
        use_chunk: True
        enable_distributed_storage: True
        placement_policy: auto
        force_outputs_fp32: true

    log_every_n_steps: 2
    logger: True
    default_root_dir: "/tmp/diff_log/"
    # profiler: pytorch

  logger_config:
    wandb:
      target: loggers.WandbLogger
      params:
          name: nowname
          save_dir: "/tmp/diff_log/"
          offline: opt.debug
          id: nowname

Thomas2419 commented 1 year ago

Hmm It turns out I am actually having the same issue. I've done multiple tests with my training so far, and my loss does go down.

haoli-zbdbc commented 1 year ago

In fact, the BUG seemed to have been introduced two weeks ago, and I tried the Teyvat Datasets provided in the example and got the same exception. [BUG]: finetone on Teyvat Datasets, but got unexpected results #2140

Thomas2419 commented 1 year ago

@LhaoH Have you tried using the docker image to make an environment and see if the bug still persists there? I’m wondering if it’s a highly specific problem with package versions that doesn’t throw an error.

haoli-zbdbc commented 1 year ago

@Thomas2419 I tried the Docker image at first, but I got it wrong at install apex (the third RUN command) , so I abandoned the Docker image.I don't feel like it has anything to do with the environment. After all, I can train the data set normally and generate a checkpoint file.

Thomas2419 commented 1 year ago

I trained a model before the repository updated to stable diffusion 2.0 and my checkpoints were always 11.1 gb and the new checkpoints I'm getting are 10.4. I know things were made more efficient, but I've also had problem getting things like apex to work so I'm wondering if while no errors are being thrown without some of these packages it is not properly working.

hpcaitech / ColossalAI

[BUG] TypeError: unsupported operand type(s) for /=: 'NoneType' and 'int' #2204

🐛 Describe the bug

Environment