huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.2k stars 26.33k forks source link

fine tune patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 using LED updates #9745

Closed mmoya01 closed 3 years ago

mmoya01 commented 3 years ago

I was wondering if there was anyway to fine tune the patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 model instead of patrickvonplaten/led-large-16384-pubmed? When I tried fine tuning it in the past I ran into the TypeError: forward() got an unexpected keyword argument 'head_mask' issue given that EncoderDecoderModel wasn't intended for longformer. So I'm now trying to see if I can use LEDForConditionalGeneration for it but I noticed when I try doing:

from transformers import LEDTokenizer, LEDForConditionalGeneration

article = """(CNN)James Holmes made his introduction to the world in a Colorado cinema filled with spectators watching a midnight showing of the new Batman movie, "The Dark Knight Rises," in June 2012. The moment became one of the deadliest shootings in U.S. history. Holmes is accused of opening fire on the crowd, killing 12 people and injuring or maiming 70 others in Aurora, a suburb of Denver. Holmes appeared like a comic book character: He resembled the Joker, with red-orange hair, similar to the late actor Heath Ledger\'s portrayal of the villain in an earlier Batman movie, authorities said. But Holmes was hardly a cartoon. Authorities said he wore body armor and carried several guns, including an AR-15 rifle, with lots of ammo. He also wore a gas mask. Holmes says he was insane at the time of the shootings, and that is his legal defense and court plea: not guilty by reason of insanity. Prosecutors aren\'t swayed and will seek the death penalty. Opening statements in his trial are scheduled to begin Monday. Holmes admits to the shootings but says he was suffering "a psychotic episode" at the time,  according to court papers filed in July 2013 by the state public defenders, Daniel King and Tamara A. Brady. Evidence "revealed thus far in the case supports the defense\'s position that Mr. Holmes suffers from a severe mental illness and was in the throes of a psychotic episode when he committed the acts that resulted in the tragic loss of life and injuries sustained by moviegoers on July 20, 2012," the public defenders wrote. Holmes no longer looks like a dazed Joker, as he did in his first appearance before a judge in 2012. He appeared dramatically different in January when jury selection began for his trial: 9,000 potential jurors were summoned for duty, described as one of the nation\'s largest jury calls. Holmes now has a cleaner look, with a mustache, button-down shirt and khaki pants. In January, he had a beard and eyeglasses. If this new image sounds like one of an academician, it may be because Holmes, now 27, once was one. Just before the shooting, Holmes was a doctoral student in neuroscience, and he was studying how the brain works, with his schooling funded by a U.S. government grant. Yet for all his learning, Holmes apparently lacked the capacity to command his own mind, according to the case against him. A jury will ultimately decide Holmes\' fate. That panel is made up of 12 jurors and 12 alternates. They are 19 women and five men, and almost all are white and middle-aged. The trial could last until autumn. When jury summonses were issued in January, each potential juror stood a 0.2% chance of being selected, District Attorney George Brauchler told the final jury this month. He described the approaching trial as "four to five months of a horrible roller coaster through the worst haunted house you can imagine." The jury will have to render verdicts on each of the 165 counts against Holmes, including murder and attempted murder charges. Meanwhile, victims and their relatives are challenging all media outlets "to stop the gratuitous use of the name and likeness of mass killers, thereby depriving violent individuals the media celebrity and media spotlight they so crave," the No Notoriety group says. They are joined by victims from eight other mass shootings in recent U.S. history. Raised in central coastal California and in San Diego, James Eagan Holmes is the son of a mathematician father noted for his work at the FICO firm that provides credit scores and a registered nurse mother, according to the U-T San Diego newspaper. Holmes also has a sister, Chris, a musician, who\'s five years younger, the newspaper said. His childhood classmates remember him as a clean-cut, bespectacled boy with an "exemplary" character who "never gave any trouble, and never got in trouble himself," The Salinas Californian reported. His family then moved down the California coast, where Holmes grew up in the San Diego-area neighborhood of Rancho Peñasquitos, which a neighbor described as "kind of like Mayberry," the San Diego newspaper said. Holmes attended Westview High School, which says its school district sits in "a primarily middle- to upper-middle-income residential community." There, Holmes ran cross-country, played soccer and later worked at a biotechnology internship at the Salk Institute and Miramar College, which attracts academically talented students. By then, his peers described him as standoffish and a bit of a wiseacre, the San Diego newspaper said. Holmes attended college fairly close to home, in a neighboring area known as Southern California\'s "inland empire" because it\'s more than an hour\'s drive from the coast, in a warm, low-desert climate. He entered the University of California, Riverside, in 2006 as a scholarship student. In 2008 he was a summer camp counselor for disadvantaged children, age 7 to 14, at Camp Max Straus, run by Jewish Big Brothers Big Sisters of Los Angeles. He graduated from UC Riverside in 2010 with the highest honors and a bachelor\'s degree in neuroscience. "Academically, he was at the top of the top," Chancellor Timothy P. White said. He seemed destined for even higher achievement. By 2011, he had enrolled as a doctoral student in the neuroscience program at the University of Colorado Anschutz Medical Campus in Aurora, the largest academic health center in the Rocky Mountain region. The doctoral in neuroscience program attended by Holmes focuses on how the brain works, with an emphasis on processing of information, behavior, learning and memory. Holmes was one of six pre-thesis Ph.D. students in the program who were awarded a neuroscience training grant from the National Institutes of Health. The grant rewards outstanding neuroscientists who will make major contributions to neurobiology. A syllabus that listed Holmes as a student at the medical school shows he was to have delivered a presentation about microRNA biomarkers. But Holmes struggled, and his own mental health took an ominous turn. In March 2012, he told a classmate he wanted to kill people, and that he would do so "when his life was over," court documents said. Holmes was "denied access to the school after June 12, 2012, after he made threats to a professor," according to court documents. About that time, Holmes was a patient of University of Colorado psychiatrist Lynne Fenton. Fenton was so concerned about Holmes\' behavior that she mentioned it to her colleagues, saying he could be a danger to others, CNN affiliate KMGH-TV reported, citing sources with knowledge of the investigation. Fenton\'s concerns surfaced in early June, sources told the Denver station. Holmes began to fantasize about killing "a lot of people" in early June, nearly six weeks before the shootings, the station reported, citing unidentified sources familiar with the investigation. Holmes\' psychiatrist contacted several members of a "behavioral evaluation and threat assessment" team to say Holmes could be a danger to others, the station reported. At issue was whether to order Holmes held for 72 hours to be evaluated by mental health professionals, the station reported. "Fenton made initial phone calls about engaging the BETA team" in "the first 10 days" of June, but it "never came together" because in the period Fenton was having conversations with team members, Holmes began the process of dropping out of school, a source told KMGH. Defense attorneys have rejected the prosecution\'s assertions that Holmes was barred from campus. Citing statements from the university, Holmes\' attorneys have argued that his access was revoked because that\'s normal procedure when a student drops enrollment. What caused this turn for the worse for Holmes has yet to be clearly detailed. In the months before the shooting, he bought four weapons and more than 6,000 rounds of ammunition, authorities said. Police said he also booby-trapped his third-floor apartment with explosives, but police weren\'t fooled. After Holmes was caught in the cinema parking lot immediately after the shooting, bomb technicians went to the apartment and neutralized the explosives. No one was injured at the apartment building. Nine minutes before Holmes went into the movie theater, he called a University of Colorado switchboard, public defender Brady has said in court. The number he called can be used to get in contact with faculty members during off hours, Brady said. Court documents have also revealed that investigators have obtained text messages that Holmes exchanged with someone before the shooting. That person was not named, and the content of the texts has not been made public. According to The New York Times, Holmes sent a text message to a fellow graduate student, a woman, about two weeks before the shooting. She asked if he had left Aurora yet, reported the newspaper, which didn\'t identify her. No, he had two months left on his lease, Holmes wrote back, according to the Times. He asked if she had heard of "dysphoric mania," a form of bipolar disorder marked by the highs of mania and the dark and sometimes paranoid delusions of major depression. The woman asked if the disorder could be managed with treatment. "It was," Holmes wrote her, according to the Times. But he warned she should stay away from him "because I am bad news," the newspaper reported. It was her last contact with Holmes. After the shooting, Holmes\' family issued a brief statement: "Our hearts go out to those who were involved in this tragedy and to the families and friends of those involved," they said, without giving any information about their son. Since then, prosecutors have refused to offer a plea deal to Holmes. For Holmes, "justice is death," said Brauchler, the district attorney. In December, Holmes\' parents, who will be attending the trial, issued another statement: They asked that their son\'s life be spared and that he be sent to an institution for mentally ill people for the rest of his life, if he\'s found not guilty by reason of insanity. "He is not a monster," Robert and Arlene Holmes wrote, saying the death penalty is "morally wrong, especially when the condemned is mentally ill." "He is a human being gripped by a severe mental illness," the parents said. The matter will be settled by the jury. CNN\'s Ana Cabrera and Sara Weisfeldt contributed to this report from Denver."""

tokenizer = LEDTokenizer.from_pretrained("allenai/longformer-base-4096")
model = LEDForConditionalGeneration.from_pretrained("patrickvonplaten/longformer2roberta-cnn_dailymail-fp16")#.to("cuda").half()

input_ids = tokenizer(article, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

I get strange results for that pretrained model

considerations considerations considerations lag lag lag Sith Sith Sith miracle miracle miracle Sith Sith Metropolitan Metropolitan Metropolitan Sith SithHERHERHER miracle miracle Hurt Hurt Hurt miracle miracle Joey Joey Joey Sith Sith ticking ticking ticking memorial memorial memorial tee tee tee miracle miracle Holder Holder Holder miracle miracle raspberry raspberry raspberry Sith Sithamoamoamo Sith Sith dominate dominate dominate miracle miracleDashDashDash miracle miracle scored scored scored dominate dominate Sith Sith (* (* (* dominate dominate Joey Joey miracle miracle hide hide hide miracle miracle characteristics characteristics characteristics miracle miracletighttighttight raspberry raspberry hal hal halomeveromeveromever miracle miracle ticking ticking dominate dominate Metropolitan Metropolitan dominate dominate Dek dominate dominate AWS AWS AWS sentencing sentencing sentencingCasCasCas customer customer customer Joey Joey dominate dominatetighttight miracle miracle AWS

if I try using LEDForConditionalGeneration instead of EncoderDecoderModel for model patrickvonplaten/longformer2roberta-cnn_dailymail-fp16. Is there something I'm missing? I'd greatly appreciate any feedback/help with this

patil-suraj commented 3 years ago

Hi @mmoya01

You could fine-tune longformer2roberta model using the EncoderDecoder model class. patrickvonplaten/longformer2roberta-cnn_dailymail-fp16 is already fine-tuned on but as the model card says it was fine-tuned for just demo, so you should fine-tune a new longformer2roberta. You could follow the training script given the model card or you can refer to this notebook

Also in your example, you are loading the longformer2roberta model using LEDForConditionalGeneration which doesn't seem right. It should be loaded using EncoderDecoderModel

mmoya01 commented 3 years ago

hi @patil-suraj , thank you for the reply! So if I'm understanding this correctly, I would have to train a new longformer2roberta from scratch? I was trying to avoid that because the model card mentions how it took 90 hours to fine tune roberta on cnn-daily news

The reason I was trying to use LEDForConditionalGeneration is because I wanted to fine tune it where the pretrained model was longformer2roberta instead of allenai/longformer-base-4096

so, to fine tune longformer2roberta model in the past, I tried pip installing the more_general_trainer_metric branch given the note about the trainer and then running

#!/usr/bin/env python3
import nlp
import logging
from nlp import arrow_dataset
from transformers import LongformerTokenizer, EncoderDecoderModel, Trainer, TrainingArguments

logging.basicConfig(level=logging.INFO)

model = EncoderDecoderModel.from_pretrained("patrickvonplaten/longformer2roberta-cnn_dailymail-fp16")
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096") 

#load dataset
train_bytes = s3_client.get_object(train_uri)
train = pq.read_table(BytesIO(train_bytes),columns=['reference_summary','extractive_summary'])

test_bytes = s3_client.get_object(test_uri)
test = pq.read_table(BytesIO(test_bytes),columns=['reference_summary','extractive_summary'])

train_dataset = arrow_dataset.Dataset(train)
val_dataset = arrow_dataset.Dataset(test)

# enable gradient checkpointing for longformer encoder
model.encoder.config.gradient_checkpointing = True

# set decoding params
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.max_length = 142
model.config.min_length = 56
model.config.no_repeat_ngram_size = 3
model.early_stopping = True
model.length_penalty = 2.0
model.num_beams = 4

encoder_length = 2048
decoder_length = 128*2
batch_size = 16

# map data correctly
def map_to_encoder_decoder_inputs(batch):
    # Tokenizer will automatically set [BOS] <text> [EOS]
    # cut off at Longformer at 2048
    inputs = tokenizer(batch["extractive_summary"], padding="max_length", truncation=True, max_length=encoder_length)
    # force summarization <= 256
    outputs = tokenizer(batch["reference_summary"], padding="max_length", truncation=True, max_length=decoder_length)

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # set 128 tokens to global attention
    batch["global_attention_mask"] = [[1 if i < 128*2 else 0 for i in range(sequence_length)] for sequence_length in len(inputs.input_ids) * [encoder_length]]
    batch["decoder_input_ids"] = outputs.input_ids
    batch["labels"] = outputs.input_ids.copy()
    # mask loss for padding
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]
    ]
    batch["decoder_attention_mask"] = outputs.attention_mask

    assert all([len(x) == encoder_length for x in inputs.input_ids])
    assert all([len(x) == decoder_length for x in outputs.input_ids])

    return batch

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.eos_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

# make train dataset ready
train_dataset = train_dataset.map(
    map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["extractive_summary", "reference_summary"],
)
train_dataset.set_format(
    type="torch", columns=["input_ids", "attention_mask", "global_attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

# same for validation dataset
val_dataset = val_dataset.map(
    map_to_encoder_decoder_inputs, batched=True, batch_size=batch_size, remove_columns=["extractive_summary", "reference_summary"],
)
val_dataset.set_format(
    type="torch", columns=["input_ids", "global_attention_mask", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

# set training arguments - these params are not really tuned, feel free to change
training_args = TrainingArguments(
    output_dir="./",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_from_generate=True,
    evaluate_during_training=True,
    do_train=True,
    do_eval=True,
    logging_steps=100,
    save_steps=100,
    eval_steps=100,
    overwrite_output_dir=True,
    warmup_steps=200,
    save_total_limit=3,
    fp16=False,

)

# instantiate trainer
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# start training
trainer.train()

^but that gave me TypeError: forward() got an unexpected keyword argument 'head_mask' because The EncoderDecoderModel did not work with longformer whereas LEDForConditionalGeneration does

but I'm gathering, it is not possible to fine tune the longfomer2roberta like I can with patrickvonplaten/led-large-16384-pubmed here right? I would have to fine tune/create my own longfomer2roberta trained on cnn daily, then fine tune further with my train data listed above right? If so, should I stay away from using a tokenizer/model that uses roberta-base and instead use "allenai/led-base-16384"(which I think uses BART as the base model)

Thank you for your feedback either way, I greatly appreciate it

patrickvonplaten commented 3 years ago

Hey @mmoya01, you don't have to train it from scratch - you can "warm-start" the model from the pretrained checkpoints. This blog post gives an in-detail explanation on how to do so: https://huggingface.co/blog/warm-starting-encoder-decoder

mmoya01 commented 3 years ago

Hi @patrickvonplaten thank you for your reply and the blog post. I was following your notebook and trying to adapt it to the longformer2roberta-cnn_dailymail-fp16 work using my own train_data and val_data. wondering, how could I warm-start from patrickvonplaten/longformer2roberta-cnn_dailymail-fp16?

I noticed I was able to do

roberta2roberta = EncoderDecoderModel.from_encoder_decoder_pretrained("allenai/longformer-base-4096", "roberta-base")

But I would love to do something like

roberta2roberta = EncoderDecoderModel.from_encoder_decoder_pretrained("patrickvonplaten/longformer2roberta-cnn_dailymail-fp16")

or warm-start the longformer2roberta-cnn_dailymail-fp16 checkpoint if possible rather than warm-start from allenai/longformer-base-4096? I'd greatly appreciate your feedback

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.