Can not reproduce results on phoenix2014 dataset

hulianyuyy / CorrNet

Continuous Sign Language Recognition with Correlation Network (CVPR 2023)

84 stars 14 forks source link

Can not reproduce results on phoenix2014 dataset #20

Closed atonyo11 closed 7 months ago

atonyo11 commented 7 months ago

Hi. I tried run the default code to train phoenix2014 dataset. This is log fie: log.txt

However, I cannot get same results as the paper (1% worse than reported).

[ Thu Nov 16 04:56:27 2023 ] Dev WER: 20.00% [ Thu Nov 16 04:56:27 2023 ] Best_dev: 19.80, Epoch : 27

What parameters should I change to reproduce? Thanks!

atonyo11 commented 7 months ago

By the way, I want to ask about baseline.yaml file. decode_mode: beam model_args: num_classes: 1296 c2d_type: resnet18 #resnet18, mobilenet_v2, squeezenet1_1, shufflenet_v2_x1_0, efficientnet_b1, mnasnet1_0, regnet_y_800mf, vgg16_bn, vgg11_bn, regnet_x_800mf, regnet_x_400mf, densenet121, regnet_y_1_6gf conv_type: 2 use_bn: 1

Does num_classes: 1296 correct for default? I see in the paper that, phoenix2014 dataset have "a vocabulary of 1295 signs".

feeder_args: mode: 'train' datatype: 'video' num_gloss: -1 drop_ratio: 1.0 frame_interval: 1 image_scale: 1.0 # 0-1 represents ratio, >1 represents absolute value input_size: 224 input_size: 224 but why do we have to convert image to 256x256? Thanks

percise commented 7 months ago

I encountered the same issue, and it seems like there's a 1% shortfall in the final training. How can I resolve this? Below is the log[]( log.txt

hulianyuyy commented 7 months ago

By the way, I want to ask about baseline.yaml file. decode_mode: beam model_args: num_classes: 1296 c2d_type: resnet18 #resnet18, mobilenet_v2, squeezenet1_1, shufflenet_v2_x1_0, efficientnet_b1, mnasnet1_0, regnet_y_800mf, vgg16_bn, vgg11_bn, regnet_x_800mf, regnet_x_400mf, densenet121, regnet_y_1_6gf conv_type: 2 use_bn: 1

Does num_classes: 1296 correct for default? I see in the paper that, phoenix2014 dataset have "a vocabulary of 1295 signs".

feeder_args: mode: 'train' datatype: 'video' num_gloss: -1 drop_ratio: 1.0 frame_interval: 1 image_scale: 1.0 # 0-1 represents ratio, >1 represents absolute value input_size: 224 input_size: 224 but why do we have to convert image to 256x256? Thanks

The num_classes is actually calculated adaptively in line 51 in main.py. This hyperparameter doesn't make sense. About input size, it would randomly crop a 224×224 image from 256×256 input during training.

hulianyuyy commented 7 months ago

As for the training discrepancy, there is no guarantee that you could get exactly the same results across different platforms, affected by hardware change, software version and so on. But usually, the performance gap is within 1%. Especially, you even will get different results across two runs. I haven't found an effective way to solve this problem.

atonyo11 commented 7 months ago

I got it. Thank you!