Padding in case of different number of nodes in batch

ChantalMP commented 2 years ago

Hi,

I have a few questions about the node padding.

Firstly, is my assumption correct, that the adding of -inf values in "pad_attn_bias_unsqueeze" has the same purpose as the attention_mask in BERT, so that there will be no attention to padded nodes?

If this is correct, why do you add +1 to x in the padding functions? As the attention is restricted not to attend there anyway, there could be any values in the padded nodes, so 0 could still be just as a regular feature value.

I talk about the padding like in

def pad_2d_unsqueeze(x, padlen):
    x = x + 1  # pad id = 0 -> THIS LINE
    xlen, xdim = x.size()
    if xlen < padlen:
        new_x = x.new_zeros([padlen, xdim], dtype=x.dtype)
        new_x[:xlen, :] = x
        x = new_x
    return x.unsqueeze(0)

which is used to pad x.

zhengsx commented 2 years ago

Thanks for using Graphormer. Yes, we're used to use zero to represent padding nodes.

ChantalMP commented 2 years ago

Hi, yes, my question is why or if it is necessary to have a padding token different from the input tokens, as the padded nodes are not attended.

zhengsx commented 2 years ago

Yes, padded nodes are not attended, therefore you can assign to arbitrary category.

ChantalMP commented 2 years ago

Thanks

microsoft / Graphormer

Padding in case of different number of nodes in batch #87