Nan Loss and Accuracy During pre-training the Chipformer Model

tomqingo commented 9 months ago

Hello, I have met a critical problem during pre-training of the chipformer model. When I use the adaptec1_small.pkl as the training set and run "python3 run_dt_place.py" to start the training process, the reported training losses and accuracy are both Nan and the reward sum decreases. It is quite an abnormal training process. I have no changes to the initial codes and use the data from https://drive.google.com/drive/folders/1F7075SvjccYk97i2UWhahN_9krBvDCmr. Could you help me identify this abnormal phenomenon?

epoch 40 iter 15: train loss nan. lr 9.241834e-05. acc nan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.39it/s] epoch 41 iter 15: train loss nan. lr 6.000000e-05. acc nan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.48it/s] epoch 42 iter 15: train loss nan. lr 6.000000e-05. acc nan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.49it/s] epoch 43 iter 15: train loss nan. lr 8.786797e-05. acc nan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.37it/s] epoch 44 iter 15: train loss nan. lr 2.244066e-04. acc nan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.46it/s] epoch 45 iter 15: train loss nan. lr 3.817385e-04. acc nan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.31it/s] epoch 46 iter 15: train loss nan. lr 5.165868e-04. acc nan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.36it/s] epoch 47 iter 15: train loss nan. lr 5.918590e-04. acc nan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.55it/s] epoch 48 iter 15: train loss nan. lr 5.868500e-04. acc nan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.29it/s] epoch 49 iter 15: train loss nan. lr 5.029378e-04. acc nan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.45it/s] epoch 50 iter 15: train loss nan. lr 3.632038e-04. acc nan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.40it/s] len self.net_min_max_ord 69 T_rewards [-29786.0] T_scores [-2.255142857142857]

Thank you!

AmingWu commented 8 months ago

@tomqingo , I met the same problem. Have you addressed this problem? Besides, how long do you need to train?

tomqingo commented 8 months ago

@tomqingo , I met the same problem. Have you addressed this problem? Besides, how long do you need to train?

I have not addressed this problem. Since there are some other people meeting the same problem, I am sure this is the bug in this released code! Lets' wait and see whether the author would solve this!!

laiyao1 commented 8 months ago

Dear tomqingo,

Thank you very much for your question.

Actually, I re-runed the codes many times and still cannot reproduce such an error with more than 200 epochs.

However, I update the adapect1_small.pkl dataset in the repo directly. You may try to unzip and use it directly.

WechatIMG25

tomqingo commented 8 months ago

Hello @laiyao1 ,

Thank you for helping investigate this issue.

I have tried to unzip and use the _adapect1small.pkl dataset directly. However, this issue still exists. I saw two more people who also encountered the same problem. I guess this may be caused by the environment difference for Python (Pytorch). Could you please provide the requirement.txt file or docker containing the necessary environment packages, for better reproduction of the results?

Thanks!

laiyao1 commented 8 months ago

Dear tomqingo,

Thank you for your question. I have updated the requirements.txt.

Zl1-1 commented 4 months ago

I met the same problem. Have you addressed this problem?

laiyao1 commented 4 months ago

Hi Zl1-1,

Thank you very much for your question. You may use the requirement.txt to install our environment. You can check this issue: https://github.com/laiyao1/chipformer/issues/4.

laiyao1 / chipformer

Nan Loss and Accuracy During pre-training the Chipformer Model #2