jchenghu / ExpansionNet_v2

Implementation code of the work "Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning"
https://arxiv.org/abs/2208.06551
MIT License
84 stars 24 forks source link

why is the path set when i=1? #7

Closed x921338983 closed 1 year ago

x921338983 commented 1 year ago

Good Work! But I have a puzzle about why the path is set when i=1? When i =1 its value and the value of B^bw are close to zero.What is its purpose or function?

jchenghu commented 1 year ago

Hi,

Good Work!

Thank you!

But I have a puzzle about why the path is set when i=1?

The operation paths are both set for i=1 and i=2, the only difference lies in the sign of (-1)^{i} in the forward formula (see image below), so the backward B^{bw}{i} only uses the results of forward F^{fw}{i}. So operations are replicated, one for i=1 and another one in case i=2. It was reported with the notation trick of (-1)^{i} in the paper for the sake of conciseness, but understandably, it can be confusing.

When i =1 its value and the value of B^bw are close to zero.What is its purpose or function?

I assume you noticed that B^bw is close to zero in the model I uploaded on the Drive or in one of your experiments, this behaviour should not be an issue, in terms of information loss, thanks to the fact that there is i=2

I hope to clarify more about this aspect by answering this: "why are there two operation paths, being the cases of i=1 and i=2?" I've implemented two operation paths (one for i=1 and another one for i=2) where the only difference is the sign in this point image this was done to prevent the remote possibility of all coefficients in the length transformation matrix being set to zero by the ReLU, this happens for instance, if all coefficients are negative. By doing so, if the latter case happens, although the input signal is lost for i=1, it is fully considered by the case i=2.

So basically, the existence of both i=1 and i=2 instead of only i=1 (or i=2) is merely a "safety measure". In practice, I didn't observe much difference in the performances, which means that the chance all coefficients in the length transformation matrix are negative is very low. Therefore it only adds a little bit of computational overhead...

So, about the first question, in case you observed B^bw close to zero in case i=1 that should not be a problem..

Let me know if it helps! Best regards, Jia Cheng

x921338983 commented 1 year ago

Thank you immensely for your thorough explanation and invaluable insights, which have significantly clarified my understanding of the operation paths and their implementation.

jchenghu commented 1 year ago

I'm glad it helped :-)

Feel free to open a new issue in case there are more questions!

Best regards, Jia Cheng