huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.18k stars 26.59k forks source link

Pretrain PEGASUS from scratch #8536

Closed EKebriaei closed 3 years ago

EKebriaei commented 3 years ago

I want to pre-train PEGASUS model from scratch on a language other than English. Is there any way to do this using huggingace API's? The source code released by the authors is complicated in use to pre-train. Also little documentation available to do this.

LysandreJik commented 3 years ago

@patil-suraj or @patrickvonplaten can chime in if I'm wrong, but I believe we currently only have fine-tuning & distillation schemes for the BART-family models, no pre-training.

patrickvonplaten commented 3 years ago

Hey @EKebriaei - yeah we sadly don't have any pre-training notebooks for pegasus yet. Are you looking for the summary specific pre-training of pegasus or just the BART-like denoising pre-training?

EKebriaei commented 3 years ago

Hey @EKebriaei - yeah we sadly don't have any pre-training notebooks for pegasus yet. Are you looking for the summary specific pre-training of pegasus or just the BART-like denoising pre-training?

I want to pre-train pegasus on a language other than English.

patrickvonplaten commented 3 years ago

Yeah, we don't have a script or good documentation for this yet.

cc https://github.com/huggingface/transformers/issues/8594#issuecomment-731248819

EKebriaei commented 3 years ago

Yeah, we don't have a script or good documentation for this yet.

cc #8594 (comment)

I have some dependency problems when compiling this: https://github.com/google-research/pegasus/blob/master/pegasus/ops/pretrain_parsing_ops.cc Do you have any comments that help?

patrickvonplaten commented 3 years ago

This PR will enable a pretraining script: #8731

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.

Skylixia commented 3 years ago

Yeah, we don't have a script or good documentation for this yet.

cc #8594 (comment)

Could we follow the same approach you (@patrickvonplaten) provided here to pretrain BART for PEGASUS ? PEGASUS has also a GSG training objective on top of the BART-like denoising as detailed in the original paper.
The GSG work by masking the most important sentences according to ROUGE then the target are the missing sentences. So my attempt by changing your code would be:

from transformers import PegasusTokenizer, PegasusForConditionalGeneration, PegasusConfig

tok = PegasusTokenizer.from_pretrained("google/pegasus")
model = PegasusForConditionalGeneration(PegasusConfig())

input_string = ["Pegasus is <mask_2> . <mask_1> it <mask_2> the model ."
decoder_input_string = "<s> It is pure white ."
labels_string = "It is pure white . <eos>"

input_ids = tok(input_string, add_special_tokens=False, return_tensors="pt").input_ids
decoder_input_ids =tok(decoder_input_string, add_special_tokens=False, return_tensors="pt").input_ids
labels = tok(labels_string, add_special_tokens=False, return_tensors="pt").input_ids

loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

Does this look reasonable (the selection strategy of masked sentences will naturally need to be implemented)? @patrickvonplaten

patrickvonplaten commented 3 years ago

@Skylixia - yes this looks reasonable to me! I guess in the original PEGASUS paper another masking loss was added on top of the encoder to predict the tokens, which would be difficult here (but should also be feasible). But this looks like the right approach to me!

adivekar-utexas commented 3 years ago

Hi. I've been struggling with a pretty simple issue trying to get the above code to work.

Essentially, the Pegasus tokenizer's eos is </s> (not <eos> as mentioned above) and it does not seem to have a bos symbol. So no matter what combination I try, I keep getting a ValueError as the lengths of the label and decoder inputs don't match.

I tried to follow what happens in BART, but the following does not work:

from transformers import PegasusForConditionalGeneration, PegasusTokenizer
model_name = 'google/pegasus-xsum'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

input_string = ["Pegasus is mythical . <mask_1> it names the model ."]
decoder_input_string = ["<s>It is pure white . "]
labels_string = ["It is pure white .</s>"]

input_ids = tokenizer(input_string, add_special_tokens=False, return_tensors="pt").input_ids
decoder_input_ids = tokenizer(decoder_input_string, add_special_tokens=False, return_tensors="pt").input_ids
labels = tokenizer(labels_string, add_special_tokens=False, return_tensors="pt").input_ids
loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

If I try to run this, I get Expected input batch_size (10) to match target batch_size (7). Complete stack trace:

---> 15 loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]
     16 # for _ in range(1_000):
     17 #     loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/transformers/models/pegasus/modeling_pegasus.py in forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1285         if labels is not None:
   1286             loss_fct = CrossEntropyLoss()
-> 1287             masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1))
   1288 
   1289         if not return_dict:

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/modules/loss.py in forward(self, input, target)
    959 
    960     def forward(self, input: Tensor, target: Tensor) -> Tensor:
--> 961         return F.cross_entropy(input, target, weight=self.weight,
    962                                ignore_index=self.ignore_index, reduction=self.reduction)
    963 

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   2466     if size_average is not None or reduce is not None:
   2467         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 2468     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
   2469 
   2470 

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   2259 
   2260     if input.size(0) != target.size(0):
-> 2261         raise ValueError('Expected input batch_size ({}) to match target batch_size ({}).'
   2262                          .format(input.size(0), target.size(0)))
   2263     if dim == 2:

ValueError: Expected input batch_size (10) to match target batch_size (7).
adivekar-utexas commented 3 years ago

I have opened a new issue with complete detail (and a corrected example) here: https://github.com/huggingface/transformers/issues/11541

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ParthParikh04 commented 2 years ago

Yeah, we don't have a script or good documentation for this yet.

cc #8594 (comment)

@patrickvonplaten Any update on this? I am planning on researching abstractive summarization in a non-English language and the PEGASUS model seems to be a worthwhile model to pursue. It would be great if you could either direct me to any resources or suggest another model to pursue in my project. Thanks!

joHussien commented 3 months ago

Yeah, we don't have a script or good documentation for this yet. cc #8594 (comment)

@patrickvonplaten Any update on this? I am planning on researching abstractive summarization in a non-English language and the PEGASUS model seems to be a worthwhile model to pursue. It would be great if you could either direct me to any resources or suggest another model to pursue in my project. Thanks!

@ParthParikh04 Did you figure out a solution to this?

ParthParikh04 commented 2 months ago

Yeah, we don't have a script or good documentation for this yet. cc #8594 (comment)

@patrickvonplaten Any update on this? I am planning on researching abstractive summarization in a non-English language and the PEGASUS model seems to be a worthwhile model to pursue. It would be great if you could either direct me to any resources or suggest another model to pursue in my project. Thanks!

@ParthParikh04 Did you figure out a solution to this?

Nope, unfortunately not. Please let me know if you end up finding a solution though!