[Transformer_modified] main.py: step, epoch, LRScheduler

현재 config.py 부분에서 num_train_steps 가 없어서 이를 직접 계산 $\rightarrow$ 이를 LRScheduler에 전달

num_epochs=1 에선 괜찮지만, 숫자가 커지면 ValueError가 발생.
- ValueError: Tried to step 92 times. The specified number of total steps is 90 발생함 (10일경우)
num_epochs=1로 하는 경우 하단과 같음.

09/15/2023 17:09:02 - INFO - __main__ - ***** Running training *****
09/15/2023 17:09:02 - INFO - __main__ -   Total steps = 893
Training (13 / 14 Steps) (loss=98363744.00000): 100%|| 14/14 [00:01<00:00, 13.35it/s] 
total training time (s): 1.0485410690307617
total training time (ms): 0
peak memory usage (MB): 480
total memory usage (MB): 26333
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  123231 KB |  491797 KB |   26333 MB |   26213 MB |
|       from large pool |  118461 KB |  468602 KB |   24797 MB |   24681 MB |
|       from small pool |    4770 KB |   28824 KB |    1536 MB |    1531 MB |
|---------------------------------------------------------------------------|
| Active memory         |  123231 KB |  491797 KB |   26333 MB |   26213 MB |
|       from large pool |  118461 KB |  468602 KB |   24797 MB |   24681 MB |
|       from small pool |    4770 KB |   28824 KB |    1536 MB |    1531 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  577536 KB |  577536 KB |  577536 KB |       0 B  |
|       from large pool |  546816 KB |  546816 KB |  546816 KB |       0 B  |
|       from small pool |   30720 KB |   30720 KB |   30720 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  126624 KB |  149264 KB |   16820 MB |   16697 MB |
|       from large pool |  112962 KB |  142768 KB |   15204 MB |   15094 MB |
|       from small pool |   13662 KB |   15786 KB |    1615 MB |    1602 MB |
|---------------------------------------------------------------------------|
| Allocations           |     462    |     608    |   13473    |   13011    |
|       from large pool |      11    |      66    |    3828    |    3817    |
|       from small pool |     451    |     542    |    9645    |    9194    |
|---------------------------------------------------------------------------|
| Active allocs         |     462    |     608    |   13473    |   13011    |
|       from large pool |      11    |      66    |    3828    |    3817    |
|       from small pool |     451    |     542    |    9645    |    9194    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      33    |      33    |      33    |       0    |
|       from large pool |      18    |      18    |      18    |       0    |
|       from small pool |      15    |      15    |      15    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      18    |      39    |    5265    |    5247    |
|       from large pool |       8    |      19    |    2077    |    2069    |
|       from small pool |      10    |      25    |    3188    |    3178    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

[Evaluation Results]
Loss: 97491072.00000
RMSE: 9873.67773
MAE: 9748.99805
total eval time: 49.589630126953125
peak memory usage (MB): 480
all memory usage (MB): 28268

현재 train 단에서 step 단위가 어떻게 되는지 확인하는 것이 필요함.

또한 scheduler를 주석처리 한 후, num_epochs를 늘리면 tqdm iterator가 갱신이 안됨. 이유파악 필요.

mlsys-lab-sogang / Social-Rec-Sys

[Transformer_modified] main.py: step, epoch, LRScheduler #3