huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.79k stars 26.96k forks source link

How can i conditional fine-tuning with GPT2? #3715

Closed toriving closed 4 years ago

toriving commented 4 years ago

I can use run_generation.py to create a statement by adding context.

But is there a way to do fine-tuning based on condition (context)? For example, when data of "context [SEP] sentence" is input, the "context" is used to obtain the hidden state without learning. In addition, the "sentence" is learned with the language model.

patrickvonplaten commented 4 years ago

To me, this sound more like a case where encoder-decoder models like T5 or Bart should be fine-tuned. The encoder would encode the "context" and the decoder would be teacher-forced on the sentence.

toriving commented 4 years ago

To me, this sound more like a case where encoder-decoder models like T5 or Bart should be fine-tuned. The encoder would encode the "context" and the decoder would be teacher-forced on the sentence.

Thx very much :)

toriving commented 4 years ago

Perhaps, Is there such logic applied to training code now?

enzoampil commented 4 years ago

@toriving I've successfully done "conditional" fine-tuning by adding a new token that indicates which portion of the sequence refers to the "context", similar to the [SEP] token used in the multi sequence version of BERT.

E.g. Here's an example of how I apply this to prepare a dataset for training GPT2 to generate answers to riddle jokes:

<soq> Why did the chicken cross the road? <eoq> To go to the other side <|endoftext|>

The effect is the answer (after <eoq>), is conditional on the question that precedes it.

toriving commented 4 years ago

@enzoampil When learning with such data, is "condition" also used in the loss function? I mean, I am wondering if "Condition" is also learning with a language model.

enzoampil commented 4 years ago

Yes if you specify it like above it should

toriving commented 4 years ago

Okay. Thanks

manzar96 commented 4 years ago

To me, this sound more like a case where encoder-decoder models like T5 or Bart should be fine-tuned. The encoder would encode the "context" and the decoder would be teacher-forced on the sentence.

I would like to ask if you think that using the encoder-decoder model (with wrapping the gpt2 model as encoder and decoder too) will provide normal results, or wrapping the gpt2 model as encoder is not a good idea(maybe use bert as encoder?)?

patrickvonplaten commented 4 years ago

currently only bert2bert is supported with the EncoderDecoder structure.

manzar96 commented 4 years ago

@toriving I've successfully done "conditional" fine-tuning by adding a new token that indicates which portion of the sequence refers to the "context", similar to the [SEP] token used in the multi sequence version of BERT.

E.g. Here's an example of how I apply this to prepare a dataset for training GPT2 to generate answers to riddle jokes:

<soq> Why did the chicken cross the road? <eoq> To go to the other side <|endoftext|>

The effect is the answer (after <eoq>), is conditional on the question that precedes it.

i would like to ask if you masked inputs part on labels on forward function. What I mean is that you maybe pass labels=input_ids to the forward function. So you set only the padding tokens as masked (value -100) or you set as masked the input tokens too? As we try to perform conditional generation, I think we should count on loss only the reply(?).