syuoni / eznlp

Easy Natural Language Processing
Apache License 2.0
130 stars 21 forks source link

A question about the second item of biaffine formula. #48

Closed EeyoreLee closed 4 months ago

EeyoreLee commented 4 months ago

@syuoni - Hi, thanks for your works about the special regularization technique for span-based NER in advance. I'm confused why the biaffine formula is xWy+xyU+b not xWy+b or [x;1]W[y;1]. What is the purpose of the second item? I think it may make a interaction between tokens in pairs. Do you have some insights about it. Any reply will be appreciated.

EeyoreLee commented 4 months ago

Oops, I saw it wrong. The second item is (x⊕y)U and I found out you add a relative position encoding w_j-i, right? it still make me confused why not just [x;1]W[y;1].

syuoni commented 4 months ago

In [x;1]W[y;1], you should also expand two of the W's dimensions by 1, right? If so, the extended part is equivalent to U and b in the former formula.

I think the two implementations should be equivalent.

This is similar to that most people implement a linear layer with an explicit bias: y = Wx + b. You may alternatively implement it with y = W[x;1] with W extended.

EeyoreLee commented 4 months ago

I think [x;1]W[y;1] is equivalent to xWy+xUy+b, but, in your paper, xWy+ U(x⊕y⊕w) + b is mentioned in the section 3. For dim size, if x and y are same like (1,5), then [x;1]W[y;1] get (1,6)(6,6)(6,1) -> 1, but the dimensions of items for second formula seems not match. if x and y still same as (1,5), xWy get (1,5)(5,5)(5,1)->1, but U matmul (x⊕y⊕w) obviously will get a vector. So Idk how to understand this. BTW means concat, am I right ?

syuoni commented 4 months ago

Let's use the original symbols in the paper (W and U are swapped in our above conversations).

The biaffine score is computed as:

r = x U y + W [x; y; w] + b

where U is of shape (d, c, d), W is of shape (c, 2d+d_w), and b is of shape (c). Thus, the output r is of shape (c). x and y are of shape (d) and w is of shape (d_w). Note the c is entity category number.

In your example, U is of shape (5, c, 5), W is of shape (c, 10+d_w).


And if we alternatively implement biaffine as:

r = [x;1] U [y;1]

The U will be of shape (d+1, c, d+1), namely (6, c, 6) in your example. The output r is still of shape (c).

syuoni commented 4 months ago

And yes, means concat.

EeyoreLee commented 4 months ago

@syuoni - Thanks for clarifying. I have no doubt about the dimensions part. But [x;1] U [y;1] must have a item with x_i*∑u_j*y_i. so if we extend [x;1] U [y;1], it seems be xUy + W(x*y)+ b not xUy + W(x⊕y)+b. * means mul element-wise. Is there something wrong above ?

syuoni commented 4 months ago

Note that:

[x;1] U [y;1] = x U[:-1, :, :-1] y + x U[:-1, :, -1] + U[-1, :, :-1] y + U[-1, :, -1]

So, U[:-1, :, -1] and U[-1, :, :-1] can be combined as W, and U[-1, :, -1] is the bias.

syuoni commented 4 months ago

You may carefully check the last dimensions of U. In the extended part, there is no multiplication between x and y.

EeyoreLee commented 4 months ago

Note that:

[x;1] U [y;1] = x U[:-1, :, :-1] y + x U[:-1, :, -1] + U[-1, :, :-1] y + U[-1, :, -1]

So, U[:-1, :, -1] and U[-1, :, :-1] can be combined as W, and U[-1, :, -1] is the bias.

That makes sense. Thanks so much, mate. The mistake I maked is I always use a matrix U to simply calculate but it's 3D-matrix.