Closed skdbsxir closed 1 year ago
현재 config.py 부분에서 num_train_steps 가 없어서 이를 직접 계산 $\rightarrow$ 이를 LRScheduler에 전달
config.py
num_train_steps
num_epochs=1
ValueError: Tried to step 92 times. The specified number of total steps is 90
09/15/2023 17:09:02 - INFO - __main__ - ***** Running training ***** 09/15/2023 17:09:02 - INFO - __main__ - Total steps = 893 Training (13 / 14 Steps) (loss=98363744.00000): 100%|| 14/14 [00:01<00:00, 13.35it/s] total training time (s): 1.0485410690307617 total training time (ms): 0 peak memory usage (MB): 480 total memory usage (MB): 26333 |===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 123231 KB | 491797 KB | 26333 MB | 26213 MB | | from large pool | 118461 KB | 468602 KB | 24797 MB | 24681 MB | | from small pool | 4770 KB | 28824 KB | 1536 MB | 1531 MB | |---------------------------------------------------------------------------| | Active memory | 123231 KB | 491797 KB | 26333 MB | 26213 MB | | from large pool | 118461 KB | 468602 KB | 24797 MB | 24681 MB | | from small pool | 4770 KB | 28824 KB | 1536 MB | 1531 MB | |---------------------------------------------------------------------------| | GPU reserved memory | 577536 KB | 577536 KB | 577536 KB | 0 B | | from large pool | 546816 KB | 546816 KB | 546816 KB | 0 B | | from small pool | 30720 KB | 30720 KB | 30720 KB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 126624 KB | 149264 KB | 16820 MB | 16697 MB | | from large pool | 112962 KB | 142768 KB | 15204 MB | 15094 MB | | from small pool | 13662 KB | 15786 KB | 1615 MB | 1602 MB | |---------------------------------------------------------------------------| | Allocations | 462 | 608 | 13473 | 13011 | | from large pool | 11 | 66 | 3828 | 3817 | | from small pool | 451 | 542 | 9645 | 9194 | |---------------------------------------------------------------------------| | Active allocs | 462 | 608 | 13473 | 13011 | | from large pool | 11 | 66 | 3828 | 3817 | | from small pool | 451 | 542 | 9645 | 9194 | |---------------------------------------------------------------------------| | GPU reserved segments | 33 | 33 | 33 | 0 | | from large pool | 18 | 18 | 18 | 0 | | from small pool | 15 | 15 | 15 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 18 | 39 | 5265 | 5247 | | from large pool | 8 | 19 | 2077 | 2069 | | from small pool | 10 | 25 | 3188 | 3178 | |---------------------------------------------------------------------------| | Oversize allocations | 0 | 0 | 0 | 0 | |---------------------------------------------------------------------------| | Oversize GPU segments | 0 | 0 | 0 | 0 | |===========================================================================| [Evaluation Results] Loss: 97491072.00000 RMSE: 9873.67773 MAE: 9748.99805 total eval time: 49.589630126953125 peak memory usage (MB): 480 all memory usage (MB): 28268
현재 train 단에서 step 단위가 어떻게 되는지 확인하는 것이 필요함.
또한 scheduler를 주석처리 한 후, num_epochs를 늘리면 tqdm iterator가 갱신이 안됨. 이유파악 필요.
num_epochs
230918 처리 완료
valid()
eval()
train()
tqdm
현재
config.py
부분에서num_train_steps
가 없어서 이를 직접 계산 $\rightarrow$ 이를 LRScheduler에 전달num_epochs=1
에선 괜찮지만, 숫자가 커지면 ValueError가 발생.ValueError: Tried to step 92 times. The specified number of total steps is 90
발생함 (10일경우)num_epochs=1
로 하는 경우 하단과 같음.현재 train 단에서 step 단위가 어떻게 되는지 확인하는 것이 필요함.
또한 scheduler를 주석처리 한 후,
num_epochs
를 늘리면 tqdm iterator가 갱신이 안됨. 이유파악 필요.