huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.4k stars 26.37k forks source link

Simple questions about EncoderDecoderModel #11424

Closed qute012 closed 3 years ago

qute012 commented 3 years ago

First, thank you for great works!

  1. Does that tie function work for sharing pretrained weight of encoder's embedding with weight of decoder's embedding?https://github.com/huggingface/transformers/blob/52166f672ed337934d90cc525c226d93209e0923/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py#L185

If i want to use different tokenizer between encoder and decoder inputs, does tie function ignore sharing embedding such as strict option?

patil-suraj commented 3 years ago

Hi @qute012

The tie_weights method ties all the weights of the encoder and decoder including the embedding. For this to work, the encoder and decoder need to be the same model (same class) i.e either BERT2BERT or ROBERTA2ROBERTA2 and with same size.

If i want to use different tokenizer between encoder and decoder inputs, does tie function ignore sharing embedding such as strict option?

no, it does not ignore sharing embedding in that case because as I wrote above it expects both encoder and decoder to be the same model so implicitly assumes that the tokenizer will also be the same.

But if that's what you want to do you could manually untie the embeddings or just re-initialize both of them so they won't be shared/tied.

qute012 commented 3 years ago

Thanks to reply @patil-suraj

For example, is it right that encoder's embedding weights can be adjusted by decoder's input?

Then if i want to untie, should i remove or comment out below code manually? https://github.com/huggingface/transformers/blob/52166f672ed337934d90cc525c226d93209e0923/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py#L183

do you have any plan to add parameter that choosing tie function in EncoderDecoderModel class? I know bert2bert is better performance than random decoder's embedding weights, but it requires to extending for experiment newly when each encoder and decoder use different vocabulary. If you okay, i will PR.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.