huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

132.41k stars 26.37k forks source link

about encoder and decoder input when using seq2seq model #6487

Closed jungwhank closed 4 years ago

jungwhank commented 4 years ago

❓ Questions & Help

Details

Hello, I'm trying to using seq2seq model (such as bart and EncoderDecoderModel(bert2bert)) And I'm little bit confused about input_ids, decoder_input_ids, tgt in model inputs.

As I know in seq2seq model, decoder_input should have special token(\ ~~or something) before the sentence and target should have special token(\~~ or somethin) after the sentence. for example, decoder_input = <s> A B C D E , target = A B C D E</s>

so my question is

Should I put the these special tokens in decoder_inputs_ids and tgt_ids when using seq2seq model in this library? or can i just pass the decoder_input_ids and tgt_ids without any special token ids?
Also, should I put add_special_tokens=True for encoder input_ids and put \ or \ token after target ids? for example, input = a b c d e, decoder_input = <s>A B C D E, target = A B C D E</s>

patil-suraj commented 4 years ago

Hi @jungwhank for Bert2Bert, pad_token is used as decoder_start_token_id and the input_ids and labels begin with cls_token_id ([CLS] for bert ) and end with sep_token_id ([SEP] for bert).

For training all you need to do is

input_text = "some input text"
target_text = "some target text"
input_ids = tokenizer(input_text,  add_special_tokens=True, return_tensors="pt")["input_ids"]
target_ids = tokenizer(target_text, add_special_tokens=True, return_tensors="pt")["input_ids"]
model(input_ids=input_ids, decoder_input_ids=target_ids, labels=target_ids)

The EncoderDecoderModel class takes care adding pad_token to the decoder_input_ids.

for inference

model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)

Hope this clarifies your question. Also pinging @patrickvonplaten for more info.

jungwhank commented 4 years ago

Hi, @patil-suraj Thanks for answering. is it same for BartForConditionalGeneration? Actually, I wanna do kind of translation task and is it same decoder_inputs_ids and labels?

patrickvonplaten commented 4 years ago

@patil-suraj's answer is correct! For the EncoderDecoder framework, one should set model.config.decoder_start_token_id to the BOS token (which in BERT's case does not exist so that we simply use CLS token).

Bart is a bit different:

if you want to generate from a pretrained model, all you have to do is: model.generate(input_ids). input_ids always refer to the encoder input tokens for Seq2Seq models and it depends on you if you want to add special tokens or not - this is not done automatically in the generate function.
if you want to have more control and just do one forward pass, you should define both input_ids and decoder_input_ids and in this case the decoder_input_ids should start with Bart's decoder_start_token_id model.config.decoder_start_token_id:

model(input_ids, decoder_input_ids=decoder_input_ids)

jungwhank commented 4 years ago

@patrickvonplaten thanks for answering! But I have a question that Is there decoder_start_token_id in BartConfig? Should I just make my decoder_input_ids start with Bart's model.config.bos_token_id or set model.config.decoder_start_token_id = token_id?

jungwhank commented 4 years ago

I think I solved the problem. Thanks

patil-suraj commented 4 years ago

@jungwhank Great ! Consider joining the awesome HF forum , if you haven't already :) It's the best place to ask such questions. The whole community is there to help you and your questions will also help the community.