vsitzmann / siren

Official implementation of "Implicit Neural Representations with Periodic Activation Functions"
MIT License
1.71k stars 245 forks source link

fwi throws out of memory error on a 32gb GPU; paper mentions 24gb GPU. #29

Open kkothari93 opened 3 years ago

kkothari93 commented 3 years ago

From the experiment_scripts/ folder, I try to run train_inverse_helmholtz.py experiment as follows. python3 train_inverse_helmholtz.py --experiment_name fwi --batch_size 1

The supplementary section 5.3 of the paper states that a single 24GB GPU was used for running this experiment whereas I am using a 32GB V100 which should be sufficient. However, even with a batch size of 1 I get the following error:

RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 31.75 GiB total capacity; 28.32 GiB already allocated; 11.75 MiB free; 30.49 GiB reserved in total by PyTorch)

Here is the full trace:

Traceback (most recent call last):
  File "train_inverse_helmholtz.py", line 78, in <module>
    training.train(model=model, train_dataloader=dataloader, epochs=opt.num_epochs, lr=opt.lr,
  File "../siren/training.py", line 73, in train
    losses = loss_fn(model_output, gt)
  File "../siren/loss_functions.py", line 188, in helmholtz_pml
    b, _ = diff_operators.jacobian(modules.compl_mul(B, dudx2), x)
  File "../siren/diff_operators.py", line 53, in jacobian
    jac[:, :, i, :] = grad(y_flat, x, torch.ones_like(y_flat), create_graph=True)[0]
  File ".../anaconda3/envs/tf-gpu2/lib/python3.8/site-packages/torch/autograd/__init__.py", line 202, in grad
    return Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 31.75 GiB total capacity; 28.32 GiB already allocated; 11.75 MiB free; 30.49 GiB reserved in total by PyTorch)

Can you please help?

gabewb commented 3 years ago

Yeah, I'm likewise not seeing the memory requirements scaling down with lower batch sizes on some experiments. I run out of memory with batch size 1 on train_img.py (I have a 6GB GPU).

Batch size memory scaling does work for train_sdf.py (point cloud). I'm able to get that <6GB with a batch size of 100000.

pielbia commented 2 years ago

Same here, I have also tried to reduce the batch size in train_inverse_helmholtz.py to no avail. Also running a 32 GB GPU and also getting a CUDA run out of memory error.

xefonon commented 2 years ago

Hi, I'm getting exactly the same problem. I tried using the python garbage collector (ie. gc.collect()) and torch.cuda.empty_cache() but it still crashes with OOM. @vsitzmann any suggestions?