Error when training model

CheungBH commented 3 years ago

Hello. Thanks for your work. I am trying to train a GCN model using the command _python3 run_baseline.py --note pretrain --dropout 0 --lr 2e-2 --epochs 100 --posenet_name 'gcn' --checkpoint './checkpoint/pretrainbaseline' --keypoints gt, but an error occurs. Traceback (most recent call last): File "run_baseline.py", line 102, in main(args) File "run_baseline.py", line 65, in main glob_step, args.lr_decay, args.lr_gamma, max_norm=args.max_norm) File "/home/hkuit155/Documents/PoseAug/function_baseline/model_pos_train.py", line 41, in train outputs_3d = model_pos(inputs_2d) File "/home/hkuit155/anaconda3/envs/poseaug/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/hkuit155/Documents/PoseAug/models_baseline/gcn/sem_gcn.py", line 104, in forward out = self.gconv_input(x) File "/home/hkuit155/anaconda3/envs/poseaug/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/hkuit155/anaconda3/envs/poseaug/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/hkuit155/anaconda3/envs/poseaug/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/hkuit155/Documents/PoseAug/models_baseline/gcn/sem_gcn.py", line 28, in forward x = self.gconv(x).transpose(1, 2) File "/home/hkuit155/anaconda3/envs/poseaug/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/hkuit155/Documents/PoseAug/models_baseline/gcn/sem_graph_conv.py", line 43, in forward output = torch.matmul(adj M, h0) + torch.matmul(adj * (1 - M), h1) RuntimeError: invalid argument 6: wrong matrix size at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:492

A same error also occurs when training ST-GCN

Traceback (most recent call last): File "run_baseline.py", line 102, in main(args) File "run_baseline.py", line 65, in main glob_step, args.lr_decay, args.lr_gamma, max_norm=args.max_norm) File "/home/hkuit155/Documents/PoseAug/function_baseline/model_pos_train.py", line 41, in train outputs_3d = model_pos(inputs_2d) File "/home/hkuit155/anaconda3/envs/poseaug/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, **kwargs) File "/home/hkuit155/Documents/PoseAug/models_baseline/models_st_gcn/st_gcn_single_frame_test.py", line 461, in forward x = torch.matmul(x, C) # nx2x17 RuntimeError: invalid argument 6: wrong matrix size at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:492

CheungBH commented 3 years ago

Also, when I am going to train videopose3d, I found a different error

==> Using settings Namespace(action_wise=True, actions='*', batch_size=1024, checkpoint='./checkpoint/pretrain_baseline', dataset='h36m', downsample=1, dropout=0.25, epochs=50, evaluate='', keypoints='gt', lr=0.001, lr_decay=100000, lr_gamma=0.96, max_norm=True, note='pretrain', num_workers=2, posenet_name='videopose', pretrain=False, random_seed=0, s1only=False, snapshot=25, stages=4) ==> Loading dataset... ==> Preparing data... ==> Loading 2D detections... Generating 1559752 poses... Generating 543344 poses... Generating 2929 poses... ==> Creating PoseNet model... create model: videopose ==> Total parameters for model videopose: 8.49M ==> Prepare optimizer... ==> Making checkpoint dir: ./checkpoint/pretrain_baseline/videopose/gt/0701165527_pretrain

Epoch: 1 | LR: 0.00100000 Traceback (most recent call last): File "run_baseline.py", line 102, in main(args) File "run_baseline.py", line 65, in main glob_step, args.lr_decay, args.lr_gamma, max_norm=args.max_norm) File "/home/hkuit155/Documents/PoseAug/function_baseline/model_pos_train.py", line 41, in train outputs_3d = model_pos(inputs_2d) File "/home/hkuit155/anaconda3/envs/poseaug/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, **kwargs) File "/home/hkuit155/Documents/PoseAug/models_baseline/videopose/model_VideoPose3D.py", line 81, in forward x = x.view(x.shape[0], 1, 16, 2) # 0924 RuntimeError: shape '[1024, 1, 16, 2]' is invalid for input of size 34816

I wonder how to fix the bug? Thank you

CheungBH commented 3 years ago

My environment is: Ubuntu 16.04 python 3.6.9 torch 1.0.1.post2 torchvision 0.2.2 cudatoolkit 10.1.243

CheungBH commented 3 years ago

I found that the 2d and 3d dataset contain 17 and 16 joints respectively. Is it abnormal?

Garfield-kh commented 3 years ago

The 2d and 3d dataset should be both in 16 joints definition. May I ask if you generate the 2d posedata data_2d_h36m_gt.npz from ./data/prepare_data_h36m.py or copy from somewhere else?

CheungBH commented 3 years ago

Oh, I copied it from VideoPose3D repo which I have run before. Is the code for preprocess different?

CheungBH commented 3 years ago

The 2d and 3d dataset should be both in 16 joints definition. May I ask if you generate the 2d posedata data_2d_h36m_gt.npz from ./data/prepare_data_h36m.py or copy from somewhere else?

Thank you. Problem solved. Maybe you can add some notifications since the preprocess code is slightly different with VideoPose3d

Garfield-kh commented 3 years ago

Sure, thank you for the suggestion.

jfzhang95 / PoseAug

Error when training model #6