gnobitab / RectifiedFlow

Official Implementation of Rectified Flow (ICLR2023 Spotlight)
663 stars 40 forks source link

out of memory #5

Closed pengzhangzhi closed 1 year ago

pengzhangzhi commented 1 year ago

Hi, I run the inference example python ./main.py --config ./configs/rectified_flow/cifar10_rf_gaussian_ddpmpp.py --eval_folder eval --mode eval --workdir ./logs/1_rectified_flow --config.eval.enable_sampling --config.eval.batch_size 1024 --config.eval.num_samples 50000 --config.eval.begin_ckpt 8 and got an OOM error. I set the batch size = 1, num_samples =1. My GPU has 24576MiB.

Is there any way to bypass the OOM?

I0220 07:32:34.000750 139647609180864 resolver.py:106] Using /tmp/tfhub_modules to cache modules.
I0220 07:32:36.494689 139647609180864 run_lib.py:273] begin checkpoint: 8
Traceback (most recent call last):
  File "main.py", line 74, in <module>
    app.run(main)
  File "/opt/anaconda3/envs/sde/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/anaconda3/envs/sde/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "main.py", line 66, in main
    run_lib.evaluate(FLAGS.config, FLAGS.workdir, FLAGS.eval_folder)
  File "/root/RectifiedFlow/ImageGeneration/run_lib.py", line 286, in evaluate
    state = restore_checkpoint(ckpt_path, state, device=config.device)
  File "/root/RectifiedFlow/ImageGeneration/utils.py", line 14, in restore_checkpoint
    loaded_state = torch.load(ckpt_dir, map_location=device)
  File "/opt/anaconda3/envs/sde/lib/python3.7/site-packages/torch/serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/opt/anaconda3/envs/sde/lib/python3.7/site-packages/torch/serialization.py", line 1049, in _load
    result = unpickler.load()
  File "/opt/anaconda3/envs/sde/lib/python3.7/site-packages/torch/serialization.py", line 1019, in persistent_load
    load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/opt/anaconda3/envs/sde/lib/python3.7/site-packages/torch/serialization.py", line 1001, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "/opt/anaconda3/envs/sde/lib/python3.7/site-packages/torch/serialization.py", line 973, in restore_location
    return default_restore_location(storage, str(map_location))
  File "/opt/anaconda3/envs/sde/lib/python3.7/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/opt/anaconda3/envs/sde/lib/python3.7/site-packages/torch/serialization.py", line 157, in _cuda_deserialize
    return obj.cuda(device)
  File "/opt/anaconda3/envs/sde/lib/python3.7/site-packages/torch/_utils.py", line 78, in _cuda
    return torch._UntypedStorage(self.size(), device=torch.device('cuda')).copy_(self, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 918.46 MiB already allocated; 13.56 MiB free; 968.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
gnobitab commented 1 year ago

It looks like something wrong when loading the checkpoint...maybe lower pytorch version to 1.11.0 following https://pytorch.org/get-started/previous-versions/ ?