Closed skdbsxir closed 11 months ago
masking이 제대로 적용되지 않는 것으로 보임. 수정필요
outputs[0][0]
)
tensor([-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -9999., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -9999., -9999., -9999.,
-9999., -9999., -9999., -9999., -9999., -10000., -9999., -9999.,
-9999., -9999., -9999., -9999., -9999., -9999., -9999., -10000.,
-10000., -9999., -9999., -9999., -9999., -9999., -9999., -9999.,
-9999., -9999., -9999., -9999., -9999., -9999., -9999., -9999.,
-9999., -9999., -9999., -9999., -10000., -9999., -9999., -9999.,
-9999., -9999., -9999., -9999., -9999., -9999., -9999., -9999.,
-9999., -9999., -9999., -10000., -9999., -9999., -9999., -10000.,
-9999., -9999., -9999., -9999., -9999., -9999., -9999., -9999.,
-9999., -9999., -9999., -9999., -9999., -9999., -9999., -9999.,
-9999., -9999., -9999., -9999., -9999., -9999., -9999., -9999.,
-9999., -9999., -9999., -9999., -9999., -9999., -9999., -10000.,
-9999., -9999., -9999., -9999., -9999., -9999., -9999., -9999.,
-10000., -9999., -10000., -9999., -9999., -9999., -9999., -9999.,
-9999., -10000., -9999., -10000., -9999., -9999., -9999., -9999.,
-9999., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000., -10000., -10000., -10000., -10000., -10000., -10000.,
-10000., -10000.]
Memory $\rightarrow$ mask 생성 후에 전달 시, 생성 후에 delete로 지울 것.
230926 처리 완료
1. Masking
현재 masking을 생성하는 코드는 다음과 같음 (
model_utils.py
)def generate_attn_subsequent_mask(seq): """ Generate decoder's mask """ attn_shape = [seq.size(0), seq.size(1), seq.size(1)] subsequent_mask = np.triu(np.ones(attn_shape), k=1) subsequent_mask = torch.from_numpy(subsequent_mask).byte()
best model saved: step = 399 epoch = 0 dev RMSE = 9862.3994140625 dev MAE = 9727.30859375
[Validation Results] Global Steps: 399 Epoch: 0 Valid Loss: 100062520.00000 Valid RMSE: 9862.39941 Valid MAE: 9727.30859 time stamp: 20.265928745269775
[Validation Results] Global Steps: 799 Epoch: 0 Valid Loss: 100062544.00000 Valid RMSE: 9862.53613 Valid MAE: 9727.50098 time stamp: 39.78810000419617
Epoch 0 Finished (Average Loss: 100063112.0000)
[Validation Results] Global Steps: 399 Epoch: 1 Valid Loss: 100062528.00000 Valid RMSE: 9862.92578 Valid MAE: 9728.29199 time stamp: 62.101946115493774
best model saved: step = 799 epoch = 1 dev RMSE = 9862.3408203125 dev MAE = 9727.2392578125
[Validation Results] Global Steps: 799 Epoch: 1 Valid Loss: 100062512.00000 Valid RMSE: 9862.34082 Valid MAE: 9727.23926 time stamp: 82.99191665649414
Epoch 1 Finished (Average Loss: 100062744.0000) best model saved: step = 399 epoch = 2 dev RMSE = 9862.1259765625 dev MAE = 9726.84375
[Validation Results] Global Steps: 399 Epoch: 2 Valid Loss: 100062552.00000 Valid RMSE: 9862.12598 Valid MAE: 9726.84375 time stamp: 105.23369526863098
[Validation Results] Global Steps: 799 Epoch: 2 Valid Loss: 100062528.00000 Valid RMSE: 9862.57031 Valid MAE: 9727.60156 time stamp: 125.98348832130432
Epoch 2 Finished (Average Loss: 100060856.0000)
[Validation Results] Global Steps: 399 Epoch: 3 Valid Loss: 100062528.00000 Valid RMSE: 9862.23828 Valid MAE: 9727.10840 time stamp: 148.43129086494446
[Validation Results] Global Steps: 799 Epoch: 3 Valid Loss: 100062544.00000 Valid RMSE: 9862.40625 Valid MAE: 9727.20801 time stamp: 169.21785831451416
Epoch 3 Finished (Average Loss: 100062816.0000)
best model saved: step = 399 epoch = 0 dev RMSE = 3.044887065887451 dev MAE = 2.874655246734619
[Validation Results] Global Steps: 399 Epoch: 0 Valid Loss: 1.85053 Valid RMSE: 3.04489 Valid MAE: 2.87466 time stamp: 13.912893056869507
best model saved: step = 799 epoch = 0 dev RMSE = 1.6691110134124756 dev MAE = 1.5905847549438477
[Validation Results] Global Steps: 799 Epoch: 0 Valid Loss: 3.05115 Valid RMSE: 1.66911 Valid MAE: 1.59058 time stamp: 27.654117107391357
Epoch 0 Finished (Average Loss: 3.6482)
[Validation Results] Global Steps: 399 Epoch: 1 Valid Loss: 1.59317 Valid RMSE: 2.51561 Valid MAE: 2.41958 time stamp: 42.38446760177612
[Validation Results] Global Steps: 799 Epoch: 1 Valid Loss: 2.18622 Valid RMSE: 2.14294 Valid MAE: 2.02721 time stamp: 55.99804997444153
Epoch 1 Finished (Average Loss: 2.8335)
[Validation Results] Global Steps: 399 Epoch: 2 Valid Loss: 2.68127 Valid RMSE: 2.21093 Valid MAE: 2.03893 time stamp: 70.67771553993225
[Validation Results] Global Steps: 799 Epoch: 2 Valid Loss: 3.33464 Valid RMSE: 4.01860 Valid MAE: 3.82281 time stamp: 84.29553318023682
Epoch 2 Finished (Average Loss: 2.5757)
[Validation Results] Global Steps: 399 Epoch: 3 Valid Loss: 1.68288 Valid RMSE: 2.70269 Valid MAE: 2.58354 time stamp: 98.95309257507324
[Validation Results] Global Steps: 799 Epoch: 3 Valid Loss: 1.96136 Valid RMSE: 2.66983 Valid MAE: 2.49517 time stamp: 112.45902109146118
Epoch 3 Finished (Average Loss: 2.4113)
main.py
에서 model도 GPU에 정상적으로 load 되며 학습 진행 시DataLoader
에서 return되는 batch data 또한 GPU에 정상적으로 load 됨.Killed
가 발생.Decoder 측의 masking을 제거한 경우 96/100 epoch에서 Killed.
Encoder, Decoder의 masking을 모두 사용한 경우 11/100 epoch에서 Killed.