hulianyuyy / CorrNet

Continuous Sign Language Recognition with Correlation Network (CVPR 2023)
84 stars 14 forks source link

Question on Architecture Design #58

Closed ethio-artifical closed 2 weeks ago

ethio-artifical commented 3 weeks ago

Hello, How are you. First Thank You For Fast Replay on The Issue Thanks alot. You got Me Every Time on The Track.

My Question today is on architecture and Design you say on paper " you first employs a feature extractor (2D CNN) to capture frame-wise features, and then adopts a 1D CNN and a BiLSTM to perform short-term and long-term temporal modeling, respectively, followed by a classifier to predict sentences."

but When i print the model i got 3D CNN on feature Extractor. I get differ in paper and code in my understanding correct me if i wrong. Can You Understand You architecture you use and How you Use or Combine 3D CNN with 2D CNN?

hulianyuyy commented 3 weeks ago

We use the 3D resnet as our backbone, and always set the temporal kernel size as 1. So it always conduct 2D convolutions to extract spatial features.

ethio-artifical commented 3 weeks ago

What is the use of using 3D ResNet as backbone why don't you just use 2D ResNet ???

hulianyuyy commented 3 weeks ago

We previously test using temporal kernel sizes > 1 and see worse results, and thus use the ResNet as 3D architectures for convenience. We now set the temporal kernel sizes as 1 to make it behave exactly as a 2D backbone.

ethio-artifical commented 3 weeks ago

Thank a lot My Friend for you'r effort you put for my question and for you'r fast replay alot of thanks from the Bottom of my hear??

i got it thank you alot and i want to try out GRU and BIGRU how to do this in you'r code

hulianyuyy commented 3 weeks ago

You may directly replace the temporal_model in the slr_network.py with the GRU and BiGRU as you want.