mmuckley / torchkbnufft

A high-level, easy-to-deploy non-uniform Fast Fourier Transform in PyTorch.
https://torchkbnufft.readthedocs.io/
MIT License
204 stars 44 forks source link

Performance degradation for CPU NUFFT with PyTorch 1.8 #25

Closed mmuckley closed 3 years ago

mmuckley commented 3 years ago

I am noticing the performance degradation with PyTorch 1.8 for the CPU on my home system (Windows 10, i5 8400, GTX 1660, torchkbnufft version 1.1.0). It looks like the GPU is relatively unaffected. The details are below. I'm not sure why this is happening yet, but I will try to look into it. If anyone has any information, feel free to post on this issue.

PyTorch 1.8:

running profiler...
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cpu, sparse_mats: False, toep_mat: False, size_3d: None
forward average time: 2.0657340599999996, backward average time: 3.4234444799999992
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cpu, sparse_mats: False, toep_mat: True, size_3d: None
toeplitz forward/backward average time: 0.13343995500000005
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cpu, sparse_mats: True, toep_mat: False, size_3d: None
forward average time: 1.0262545000000016, backward average time: 1.0705226799999992
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cuda, sparse_mats: False, toep_mat: False, size_3d: None
GPU forward max memory: 0.159003136 GB, forward average time: 0.074529785, GPU adjoint max memory: 0.152530432 GB, backward average time: 0.0685140699999998
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cuda, sparse_mats: False, toep_mat: True, size_3d: None
GPU forward max memory: 0.114505216 GB, toeplitz forward/backward average time: 0.006467924000000096
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cuda, sparse_mats: True, toep_mat: False, size_3d: None
GPU forward max memory: 0.77268992 GB, forward average time: 0.20692490499999963, GPU adjoint max memory: 1.035167232 GB, backward average time: 0.2132972450000004

PyTorch 1.7.1:

running profiler...
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cpu, sparse_mats: False, toep_mat: False, size_3d: None
forward average time: 1.8955573599999997, backward average time: 1.6387825
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cpu, sparse_mats: False, toep_mat: True, size_3d: None
toeplitz forward/backward average time: 0.12237997000000007
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cpu, sparse_mats: True, toep_mat: False, size_3d: None
forward average time: 0.8352743000000004, backward average time: 1.01682184
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cuda, sparse_mats: False, toep_mat: False, size_3d: None
GPU forward max memory: 0.158736896 GB, forward average time: 0.07951689000000002, GPU adjoint max memory: 0.152530432 GB, backward average time: 0.06854967499999987
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cuda, sparse_mats: False, toep_mat: True, size_3d: None
GPU forward max memory: 0.114505216 GB, toeplitz forward/backward average time: 0.006591889999999978
im_size: (256, 256), spokelength: 512, num spokes: 405, ncoil: 15, batch_size: 1, device: cuda, sparse_mats: True, toep_mat: False, size_3d: None
GPU forward max memory: 0.77268992 GB, forward average time: 0.2121914199999999, GPU adjoint max memory: 1.035167232 GB, backward average time: 0.21654677000000006
mmuckley commented 3 years ago

I identified the cause of this issue. It was due to overhead of repeated calls to torch.set_num_threads. Apparently these are more expensive in PyTorch 1.8.

I previously added these lines due to an observation that torchkbnufft wouldn't respect an OMP_NUM_THREADS environment variable. For example, if you had 8 threads on your system and you set OMP_NUM_THREADS, torchknufft would use all 8 threads if you didn't use the torch.set_num_threads commands during process forks. After removing the lines, the performance of the adjoint on CPU is much better. I don't see the oversubscription issue for the adjoint, but it remains for the forward operation, so we may need to do further work on adjoint threading.

I think I'm going to release version 1.2.0 of torchkbnufft for now to handle PyTorch 1.8 without regressions, and we can try to do more threading work for the adjoint in the future.