lanyunshi / ConversationalKBQA

15 stars 2 forks source link

About Train problems #12

Closed Bigchen8013 closed 2 years ago

Bigchen8013 commented 2 years ago

Hi

When I train according to what you said, all the indicators are 0, as shown below,

Test: 100%|██████████| 10/10 [00:00<00:00, 10024.63it/s] 06/03/2022 10:21:54 - INFO - __main__ - ***** Eval results (99)***** 06/03/2022 10:21:54 - INFO - __main__ - dev reward=(0.0, 0.0) 06/03/2022 10:21:54 - INFO - __main__ - dev te reward=(0.0, 0.0) 06/03/2022 10:21:54 - INFO - __main__ - test reward=(0.0, 0.0) 06/03/2022 10:21:54 - INFO - __main__ - training loss=0.0 06/03/2022 10:21:54 - INFO - __main__ - training reward=(0.0, 0.0) 06/03/2022 10:21:54 - INFO - __main__ - training te loss=0.0 06/03/2022 10:21:54 - INFO - __main__ - training te reward=(0.0, 0.0) Epoch: 100%|██████████| 100/100 [03:53<00:00, 2.33s/it]

and the training ends in a few minutes, and the files in the cache are almost empty. Do you have a solution?

HHL-445 commented 2 years ago

The value of not_update in the retrieve_KB function is args.do_train. The retrieve_via_frontier function is called in the retrieve_KB function, and the value of not_update is also passed. But the value of arg.do_train is 1 during training and the value that not_update needs to pass in retrieve_via_frontier is False. I can successfully calculate the loss after adding not_update = False to the front of retrieve_via_frontier. You can try

Bigchen8013 commented 2 years ago

Thank you very much for your reply, I will try it first.

Bigchen8013 commented 2 years ago

Sorry to bother you again.

I follow what you said, in the retrieve_KB function, add not_update = False before calling retrieve_via_frontier, and the following error occurs: `Traceback (most recent call last): File "G:/python_projects/ConversationalKBQA-master/mycode/ConversationKBQA_Runner.py", line 1011, in main() File "G:/python_projects/ConversationalKBQA-master/mycode/ConversationKBQA_Runner.py", line 768, in main te_loss = update_policy_immediately(_te_loss, optimizer) File "G:/python_projects/ConversationalKBQA-master/mycode/ConversationKBQA_Runner.py", line 593, in update_policy_immediately adjust_loss.backward(retain_graph=True) # File "F:\ananconda\envs\py37\lib\site-packages\torch\tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "F:\ananconda\envs\py37\lib\site-packages\torch\autograd__init__.py", line 132, in backward allow_unreachable=True) # allow_unreachable flag

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1200]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!`

When I delete the not_update = False, no error is reported, but all indicators are 0.Do you know why? Looking forward to your reply, thank you very much

HHL-445 commented 2 years ago

@Bigchen8013 Add not_update=False in retrieve_via_frontier function instead of retrieve_KB function. Try it, I can do it here

HHL-445 commented 2 years ago

@Bigchen8013 The main problem is that the subgraph candidate paths are not retrieved in retrieve_via_frontier, and the SQL_1hop_interaction method inside is not executed. You can try DeBUG.