Questions on BiACM - Githubissues

kforcodeai commented 1 year ago

@jpWang first of all congratulations to all the authors of this great paper and a milestone work, it truly justifies the title SIMPLE yet EFFECTIVE

Question

From the paper I got to know, LiLT is using BiACM to introduce cross-modality interaction but could not find that part in the code, in code if I understood correctly LiLT is using separate Query, Key, and Value linear layers to calculate attention for text flow and layout flow. And then there is an addition of relative_distance_embeddings to each of them. Then it adds both temp_attention scores to get final attention scores as
```
tmp_attention_scores = attention_scores / math.sqrt(self.attention_head_size)
tmp_layout_attention_scores = layout_attention_scores / math.sqrt(self.attention_head_size//self.channel_shrink_ratio)  
attention_scores = tmp_attention_scores + tmp_layout_attention_scores
layout_attention_scores = tmp_layout_attention_scores + tmp_attention_scores 
```
is this addition ( tmp_layout_attention_scores + tmp_attention_scores ) doing the cross-modality interaction learning? Please share some thoughts on this,

I understood that during pretraining LiLT stops the gradients backpropagation through the Text flow model, so during pretraining

#  here  'tmp_layout_attention_scores` won't be added since we don't want to update attention_scores for text flow
#  also we can keep this line  unchanged and stop the gradients flow
attention_scores = tmp_attention_scores + tmp_layout_attention_scores
# this addition will change  and  `tmp_attention_scores` won't be added
layout_attention_scores = tmp_layout_attention_scores

Can you please comment on is my understanding correct?

jpWang commented 1 year ago

Hi,

Yes.

During pretraining,

layout_attention_scores = tmp_layout_attention_scores + tmp_attention_scores

is changed to

layout_attention_scores = tmp_layout_attention_scores + tmp_attention_scores.detach()

kforcodeai commented 1 year ago

thanks @jpWang ,

Just a confusion though, the paper says

In order to maintain the ability of LiLT to cooperate with different off-the-shelf text models in finetuning
as much as possible, we heuristically adopt the detached αTij for αfLij , so that the textual stream
will not be affected by the gradient of non-textual one during pre-training, and its overall consistency
can be preserved

so how does detaching textual_attention_score just from the calculation of layout_attetnion_scores preserve textual stream, if we wanted to preserve the textual stream, should we not change this line

attention_scores = tmp_attention_scores + tmp_layout_attention_scores

because in my understanding these attention scores are responsible for text stream and

layout_attention_scores = tmp_layout_attention_scores + tmp_attention_scores.detach()

is responsible for the layout stream.

if we keep the attention_scores as such, it has some contribution fromtmp_layout_attention_scores which is non-textual so how do we preserve the textual stream?

is my understanding correct? please bear with me, I am new to this

jpWang commented 1 year ago

“its overall consistency” said in the paper means the optimization consistency of the text flow. We don't want the gradients to back-propagate from layout_attention_scores to tmp_attention_scores, to influce the optimization of the text part. But on the comtrary, attention_scores should guide the optimization of the layout flow (tmp_layout_attention_scores), and itself (tmp_attention_scores). So the text flow only be influenced by itself during optimization, that‘s what we call "consistency".

kforcodeai commented 1 year ago

Understood the concept Sir, but in code LiLT is adding tmp_layout_attention_scores to attention_scores

attention_scores = tmp_attention_scores + tmp_layout_attention_scores

does this not mean -- tmp_layout_attention_scores (layout attention) has influence over attention_scores (text attention)

why only stop gradients to back-propagate from layout_attention_scores to tmp_attention_scores, why not to attention_scores also?

because as per my understanding this attention_scores become attetnion_probs and then context_layer and for the next transformer layer, they are the hidden_states, which will in turn be used to calculate tmp_attention_scores thus tmp_layout_attention_scores from the previous layer has made a contribution to text flow in this layer.

What's wrong with my understanding?

jpWang commented 1 year ago

The layout flow needs to influence the text flow in inference but not in optimziaiton; The text flow needs to influence the layout flow in both inference and optimziaiton.

kforcodeai commented 1 year ago

I am sorry but I feel too dumb now :), unable to get the picture So when you say

The text flow needs to influence the layout flow in both inference and optimziaiton. then why during pretraining, detach text attention scores ? layout_attention_scores = tmp_layout_attention_scores + tmp_attention_scores.detach()
The layout flow needs to influence the text flow in inference but not in optimization; then why not detach tmp_layout_attention_scores from the below code during pretraining ? attention_scores = tmp_attention_scores + tmp_layout_attention_scores

What am I missing?

jpWang commented 1 year ago

layout_attention_scores = tmp_layout_attention_scores + tmp_attention_scores.detach() means 1) text influences layout in inference calculation and 2) layout does not influence text in back-propagation; attention_scores = tmp_attention_scores + tmp_layout_attention_scores means 1) layout influences text in inference calculation and 2) text influences layout in back-propagation.

kforcodeai commented 1 year ago

layout_attention_scores = tmp_layout_attention_scores + tmp_attention_scores.detach() means 2) layout does not influence text in back-propagation; but we are detaching only tmp_attention_scores (text part) from back-propagation right, how does it stops tmp_layout_attention_scores or layout_attention_scores from participating in back-propagation

kforcodeai commented 1 year ago

did I ask something very dumb? @jpWang can you please point out the mistake or a resource that would help me understand

jpWang / LiLT

Questions on BiACM #23