hulianyuyy / CorrNet

Continuous Sign Language Recognition with Correlation Network (CVPR 2023)
100 stars 20 forks source link

Question on Architecture Design #58

Closed ethio-artifical closed 5 months ago

ethio-artifical commented 5 months ago

Hello, How are you. First Thank You For Fast Replay on The Issue Thanks alot. You got Me Every Time on The Track.

My Question today is on architecture and Design you say on paper " you first employs a feature extractor (2D CNN) to capture frame-wise features, and then adopts a 1D CNN and a BiLSTM to perform short-term and long-term temporal modeling, respectively, followed by a classifier to predict sentences."

but When i print the model i got 3D CNN on feature Extractor. I get differ in paper and code in my understanding correct me if i wrong. Can You Understand You architecture you use and How you Use or Combine 3D CNN with 2D CNN?

hulianyuyy commented 5 months ago

We use the 3D resnet as our backbone, and always set the temporal kernel size as 1. So it always conduct 2D convolutions to extract spatial features.

ethio-artifical commented 5 months ago

What is the use of using 3D ResNet as backbone why don't you just use 2D ResNet ???

hulianyuyy commented 5 months ago

We previously test using temporal kernel sizes > 1 and see worse results, and thus use the ResNet as 3D architectures for convenience. We now set the temporal kernel sizes as 1 to make it behave exactly as a 2D backbone.

ethio-artifical commented 5 months ago

Thank a lot My Friend for you'r effort you put for my question and for you'r fast replay alot of thanks from the Bottom of my hear??

i got it thank you alot and i want to try out GRU and BIGRU how to do this in you'r code

hulianyuyy commented 5 months ago

You may directly replace the temporal_model in the slr_network.py with the GRU and BiGRU as you want.