nxphi47 / tree_transformer

Submission to ICLR
46 stars 5 forks source link

lack of checkpoint files? #3

Open zxgx opened 4 years ago

zxgx commented 4 years ago

Hi, I'm trying to reproduce the result by following the instructions in README.md. After a long time to preprocess the data, I encountered an exception as follows:

========== INFERENCE =================  
Traceback (most recent call last):  
  File "get_last_checkpoint.py", line 44, in <module>  
    args.dir, 1, False, upper_bound=None,  
  File "get_last_checkpoint.py", line 30, in last_n_checkpoint_index  
    raise Exception('Found {} checkpoint files but need at least {}', len(entries), n)  
Exception: ('Found {} checkpoint files but need at least {}', 0, 1)  
GEN_DIR = /home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000  
GEN_OUT = /home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000/infer.avg10.b5.lp1  
AVG_NUM = 10  
LAST_EPOCH =   
AVG_CHECKPOINT_OUT = /home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000/averaged_model.id1.avg10.e.u100000000.pt  
---- Score by averaging last checkpoints 10 -> /home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000/averaged_model.id1.avg10.e.u100000000.pt  
Generating average checkpoints...  
Namespace(checkpoint_upper_bound=100000000, ema='False', ema_decay=1.0, inputs=['/home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default'], num_epoch_checkpoints=10, num_update_checkpoints=None, output='/home/zhg/train_tree_transformer/nstack_merge_translate_ende_iwslt_32k/dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubsec_allcross_hier-transformer_base-b1024-gpu8-upfre1-0fp16-id50msp1024default/infer/test.tok.rmBpey.genout.de.b5.lenpen1.leftpadFalse.avg.avg10.e.u100000000/averaged_model.id1.avg10.e.u100000000.pt', user_dir='/home/zhg/tree_transformer')  
Traceback (most recent call last):  
  File "../scripts/average_checkpoints.py", line 186, in <module>  
    main()  
  File "../scripts/average_checkpoints.py", line 169, in main  
    args.inputs, num, is_update_based, upper_bound=args.checkpoint_upper_bound,  
  File "../scripts/average_checkpoints.py", line 117, in last_n_checkpoints  
    raise Exception('Found {} checkpoint files but need at least {}', len(entries), n)  
Exception: ('Found {} checkpoint files but need at least {}', 0, 10)

I suppose that some checkpoint files generated during training are missed. Would you please tell me how can I work this out?

liuqingpu commented 4 years ago

i have meet the same issue with you. But i also have not deal with this problem. Did you have finish it ?

zxgx commented 4 years ago

@liuqingpu no

liuqingpu commented 4 years ago

@liuqingpu no Its difficult for me

villmow commented 4 years ago

I have trouble reproducing the results as well, did anyone make it work?

Exception: ('Found {} checkpoint files but need at least {}', 0, 1)

The error sounds like that the script tries inference, but there are no checkpoint files under the experiment directory. Have you managed to get training working?

Shikhar-S commented 3 years ago

I am also facing same issue. The architecture dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubenc_allcross_hier mentioned in README.md is not getting registered as a fairseq architecture. Prior to the error mentioned by OP, the code throws another error:


fairseq-train: error: argument --arch/-a: invalid choice: 'dwnstack_merge2seq_node_iwslt_onvalue_base_upmean_mean_mlesubenc_allcross_hier' (choose from 'fconv_lm', 
'fconv_lm_dauphin_wikitext103', 'fconv_lm_dauphin_gbw', 'fconv', 'fconv_iwslt_de_en', 'fconv_wmt_en_ro', 'fconv_wmt_en_de', 
'fconv_wmt_en_fr', 'fconv_self_att', 'fconv_self_att_wp', 'lightconv_lm', 'lightconv_lm_gbw', 'lightconv', 'lightconv_iwslt_de_en', 
'lightconv_wmt_en_de', 'lightconv_wmt_en_de_big', 'lightconv_wmt_en_fr_big', 'lightconv_wmt_zh_en_big', 'lstm', 
'lstm_wiseman_iwslt_de_en', 'lstm_luong_wmt_en_de', 'transformer_lm', 'transformer_lm_big', 'transformer_lm_wiki103', 
'transformer_lm_gbw', 'transformer', 'transformer_iwslt_de_en', 'transformer_wmt_en_de', 
'transformer_vaswani_wmt_en_de_big', 'transformer_vaswani_wmt_en_fr_big', 'transformer_wmt_en_de_big', 
'transformer_wmt_en_de_big_t2t', 'multilingual_transformer', 'multilingual_transformer_iwslt_de_en')

The code then continues to perform inference without any training. In the absence of a valid checkpoint it then throws OP's error. Probably looking at https://github.com/nxphi47/tree_transformer/blob/master/src/models/nstack_archs.py#L615 will help. @nxphi47 Could you please help with this?

JieYang1020 commented 2 years ago

I have the same problem, did anyone solve it?