Open x-zho14 opened 3 years ago
Hi, I experiment with the following codes:
import torch from pytorch_block_sparse import BlockSparseLinear import time import sys iter = int(sys.argv[1]) dsty = float(sys.argv[2]) fc = BlockSparseLinear(1024, 256, density=dsty) fc_dense = torch.nn.Linear(1024, 256).cuda() input = torch.ones(3, 1024).cuda() i = 0 start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) start.record() t1 = time.time() while(i < iter): output = fc(input) i += 1 end.record() t2 = time.time() torch.cuda.synchronize() print("cpu time:", t2-t1) print(start.elapsed_time(end)) print(torch.cuda.memory_summary()) i = 0 start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) start.record() t1 = time.time() while(i < iter): output = fc_dense(input) i += 1 end.record() t2 = time.time() torch.cuda.synchronize() print("cpu time:", t2-t1) print(start.elapsed_time(end)) print(torch.cuda.memory_summary())
And I find that the running time is decreased when iteration is small, while the memory consumption is not decreased. sparse:
|===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 1248 KB | 1254 KB | 7280 KB | 6032 KB | | from large pool | 0 KB | 0 KB | 0 KB | 0 KB | | from small pool | 1248 KB | 1254 KB | 7280 KB | 6032 KB | |---------------------------------------------------------------------------| | Active memory | 1248 KB | 1254 KB | 7280 KB | 6032 KB | | from large pool | 0 KB | 0 KB | 0 KB | 0 KB | | from small pool | 1248 KB | 1254 KB | 7280 KB | 6032 KB | |---------------------------------------------------------------------------| | GPU reserved memory | 2048 KB | 2048 KB | 2048 KB | 0 B | | from large pool | 0 KB | 0 KB | 0 KB | 0 B | | from small pool | 2048 KB | 2048 KB | 2048 KB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 800 KB | 2047 KB | 8080 KB | 7280 KB | | from large pool | 0 KB | 0 KB | 0 KB | 0 KB | | from small pool | 800 KB | 2047 KB | 8080 KB | 7280 KB | |---------------------------------------------------------------------------| | Allocations | 12 | 15 | 2066 | 2054 | | from large pool | 0 | 0 | 0 | 0 | | from small pool | 12 | 15 | 2066 | 2054 | |---------------------------------------------------------------------------| | Active allocs | 12 | 15 | 2066 | 2054 | | from large pool | 0 | 0 | 0 | 0 | | from small pool | 12 | 15 | 2066 | 2054 | |---------------------------------------------------------------------------| | GPU reserved segments | 1 | 1 | 1 | 0 | | from large pool | 0 | 0 | 0 | 0 | | from small pool | 1 | 1 | 1 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 5 | 5 | 1033 | 1028 | | from large pool | 0 | 0 | 0 | 0 | | from small pool | 5 | 5 | 1033 | 1028 | |===========================================================================|
dense:
|===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 1248 KB | 1251 KB | 4280 KB | 3032 KB | | from large pool | 0 KB | 0 KB | 0 KB | 0 KB | | from small pool | 1248 KB | 1251 KB | 4280 KB | 3032 KB | |---------------------------------------------------------------------------| | Active memory | 1248 KB | 1251 KB | 4280 KB | 3032 KB | | from large pool | 0 KB | 0 KB | 0 KB | 0 KB | | from small pool | 1248 KB | 1251 KB | 4280 KB | 3032 KB | |---------------------------------------------------------------------------| | GPU reserved memory | 2048 KB | 2048 KB | 2048 KB | 0 B | | from large pool | 0 KB | 0 KB | 0 KB | 0 B | | from small pool | 2048 KB | 2048 KB | 2048 KB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 800 KB | 2047 KB | 5080 KB | 4280 KB | | from large pool | 0 KB | 0 KB | 0 KB | 0 KB | | from small pool | 800 KB | 2047 KB | 5080 KB | 4280 KB | |---------------------------------------------------------------------------| | Allocations | 12 | 15 | 1066 | 1054 | | from large pool | 0 | 0 | 0 | 0 | | from small pool | 12 | 15 | 1066 | 1054 | |---------------------------------------------------------------------------| | Active allocs | 12 | 15 | 1066 | 1054 | | from large pool | 0 | 0 | 0 | 0 | | from small pool | 12 | 15 | 1066 | 1054 | |---------------------------------------------------------------------------| | GPU reserved segments | 1 | 1 | 1 | 0 | | from large pool | 0 | 0 | 0 | 0 | | from small pool | 1 | 1 | 1 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 5 | 5 | 533 | 528 | | from large pool | 0 | 0 | 0 | 0 | | from small pool | 5 | 5 | 533 | 528 | |===========================================================================|
Could you please help with finding the problem? Actually the total alloc memory is even higher. Thanks in advance.
Hi, I experiment with the following codes:
And I find that the running time is decreased when iteration is small, while the memory consumption is not decreased. sparse:
dense:
Could you please help with finding the problem? Actually the total alloc memory is even higher. Thanks in advance.