Huggingface Bert vs. Fast Transformer full attention

lipmem commented 3 years ago

First of all thank you for this amazing work!

In my research I am comparing different encoders for relation extraction. What I noticed is that the transformer implementation of this repo with full attention performs worse (regarding F1 score) than the huggingface bert implementation. I use a unpretrained huggingface bert. My expectation is that this setup should perform the same as an untrained bert from huggingface.

TransformerEncoderBuilder.from_kwargs(
            n_layers=12,
            n_heads=12,
            query_dimensions=64,
            value_dimensions=64,
            feed_forward_dimensions=3072,
            attention_type="full",
            activation="gelu"
        ).get()

Is my expectation correct? Why does it perform worse?

angeloskath commented 3 years ago

Hi,

Sure, the performance should be similar. You can also check that given the same weights the two implementations actually return exactly the same results. We have a test for this at https://github.com/idiap/fast-transformers/blob/f22c13716fc748bb21a7b226ada7f7b5f87f867f/tests/test_weight_mapper.py#L58 .

Let me know if you are still experiencing problems.

Best, Angelos

lipmem commented 3 years ago

Hi Angelos, thank you for your answer. Unfortunatley I still face the problem. I copied the source of the huggingface bert implementation and only replaced the encoder with your encoder version set to 'full' attention. Like this:

# Old
self.encoder = BertEncoder(config) 
# New
self.encoder = TransformerEncoderBuilder.from_kwargs(
            n_layers=12,
            n_heads=12,
            query_dimensions=64,
            value_dimensions=64,
            feed_forward_dimensions=3072,
            attention_type="full",
            final_normalization=False,
            activation="gelu"
        ).get()

The encoder output is then used to do multiclass classifiction for relation extraction. What happens is that the evaluating F1 score is singnificantly worse with the fast-transformer full attention encoder. W B Chart 10_09_2021, 13_00_20 I have no explanantion. Do you have any clue about this?

angeloskath commented 3 years ago

Just to make sure if you run the test from my previous comment does it pass? If it does, then this means that there is probably a misconfiguration because that test means that the models are exactly equivalent so they should train exactly the same.

If I had to guess judging by the names in the plot, the lighter one maybe uses linear attention instead of full?

Cheers, Angelos

lipmem commented 3 years ago

Sorry for the late reply. The test passes on the machine I use for training.

About the plot: the lighter one is the fast-transformers implementation set to full attention and the darker one is the huggingface bert implementation without pretrained weights. From my understanding the fast-transformers should be way closer to the huggingface bert.

angeloskath commented 3 years ago

Well if the test passes then that means that the networks are indentical! This means that they should train in exactly the same way as well. So I assume there is probably a bug in your configuration somewhere and the two networks are actually not the same.

In order to check, you should be able to copy the weights from the BertEncoder using the same code as in the test and you should get exactly the same evaluation scores.

huu4ontocord commented 3 years ago

Is it the same as this issue: https://github.com/idiap/fast-transformers/issues/103

lipmem commented 3 years ago

I replicated the test with my classsification network. First I trained the variant with the huggingface bert. Then I copied all the weights to the variant with your Full Attention. Additionally I set the norm1.eps norm2.eps to 1e-12. I passed a batch from my dataset into the huggingface variant and compared it to the output of the Full Attention variant and the output is exactly the same (Your unit test is only close, because the norm eps are different).

Despite the fact, that the two networks produce the same ouput with the same weights set. The Full attention variant learns considerable worse (F1=16.54) than the Huggingface Bert (F1=43.18).

I am still clueless why it behaves like this.

danieltudosiu commented 3 years ago

Sorry to bump this issue, but are there any updates on this?

lipmem commented 3 years ago

I resolved the issue by finding a mistake in my configuration. Thanks for the help here.

idiap / fast-transformers

Huggingface Bert vs. Fast Transformer full attention #100