Closed volker42maru closed 3 years ago
Hi, we used bert-large-cased
in the experiments
Do you plan to release the weights for the tagger as well?
I tried to train my own tagger using your code to reproduce the results for uncontrolled summarization on cnndm
, but somehow the automatically extracted keywords from the tagger look very different from the oracle keywords.
An official with France 's accident investigation | Cell phones have been collected at the site , he said but that they " had n't exploited yet .
The Palestinian Authority officially became the 123rd member of International Criminal Court on Wednesday , a step that gives court jurisdiction over alleged crimes in territories .
The organization found " positive developments worldwide , with most regions seeming to show reductions in the number of executions . | Across board exception
*I used the cnndm
hyperparams from the paper for inference --threshold 0.25 --maximum-word 30 --summary-size 10
Marseille far videos crash | video Paris clip | school
jurisdiction crimes | war crimes Israelis | Israel United opposed
world annual report death penalty | number executions worldwide | 28 compared 2013
**I preprocessed cnndm
with the script provided in this repo, but the test.oracleword
results are different from the examples in example_dataset
I would greatly appreciate your input in this matter. Thank you :)
There was some problem during training where the loss made a jump. I trained a tagger with roberta-large
now and it looks better.
French investigation crash Germanwings 9525 video board | German
Palestinian Authority 123rd member Criminal Court step court jurisdiction alleged crimes territories | Palestinians ICC Rome Statute January
world terrorism executions Amnesty International alleges annual report death penalty | worldwide 22
However, my results for uncontrolled summarization using this roberta tagger + official CTRLsum (BART) checkpoint are significantly lower than the paper reported results: 44.62 R1 ...
.
I will take a look at preprocessing again since my results where already different for the oracle tags.
Hi, first, the BERT tagger example you gave is very different from ours and seems your tagger training is somewhat problematic. For example, the training labels for the tagger do not contain any stop words but the mentioned examples contain a lot of them -- these stop words should receive a very low score from the tagger. Can you post your training log of the BERT tagger for debugging?
Second, the updated Roberta tagger example looks more reasonable to me, but note that our hyperparameters of extracting the keywords is tuned on our BERT tagger, which may not be suitable for the Roberta model.
For legal reasons I cannot release the tagger weights here, please contact me personally if you want our pretrained tagger for easier reproducing: junxianh@cs.cmu.edu
Hi,
sure, I will attach the log output here.
The loss makes a sudden jump around step 3700 and then the models seems to be stuck in some local minima. The probability distribution over tokens is basically uniform in the end, which probably explains the output resembling a sentence.
Training for the RoBERTa tagger looks fine, but you are right that I would need to tune the hyperparams myself in this case.
I will try to retrain the BERT Tagger and contact you directly for the tagger weights. Thanks so much!
I just noticed that there is random sampling step (without a defined seed) for keyword dropout 🙈: https://github.com/salesforce/ctrl-sum/blob/master/scripts/preprocess.py#L501
That explains why my text.oracleword
looks a bit different from the example given in this repo. I guess this might influence the training + final performance of the tagger to some extent. (For my BERT tagger is was probably just an unfortunate training run).
Thank you for sharing this! A few things I noticed:
I have an update on the mismatch of number of validation examples. By running the script today I got 22k validation examples, it turns out that I used valeval.seqlabel.jsonl
for validation during tagger training which contains 31k examples. The difference between these two is whether it contains segment spans that contain no keyword. It is hard to say which validation choice is actually better, but all spans should be included at prediction time
Thanks for the detailed response and for catching up on this!
Your batch size seems to be 84 while we used 128 following the script in the repo
Yes, I didn't have enough GPUs and had to reduce batch size to fit into memory. I forgot to adjust the update_freq accordingly.
The keyword dropout only influences BART training -- the tagger is trained without keyword dropout thus the random seed you referred should not influence the tagger
Right, that was my mistake.
To (4): I will test with the checkpoint with the best validation loss and check again if the results get better. I still have to optimize the hyperparams for my tagger as well (my score is around 45 R1
now).
I have shared you our pretrained BERT tagger weights over email, and you can refer to scripts/test_bart.sh
as in readme for how we computed ROUGE scores. By using the pretrained tagger I hope you can reproduce the results easily, I will close this issue for now, but feel free to reopen it if you still have trouble on this.
I could reproduce the results with your BERT tagger weights.
Thanks for the help 😀
❓ Questions and Help
Hi there,
just wanted to ask if you used
bert-large-cased
orroberta-large
to initialize the weights of the tagger (both options are in the training script).Thanks