nasa-petal / PeTaL-labeller

The PeTaL labeler labels journal articles with biomimicry functions.
https://petal-labeller.readthedocs.io/en/latest/
The Unlicense
6 stars 3 forks source link

See how replacing random weights with pretrained and fine-tuned weights in MATCH affects performance #72

Open bruffridge opened 2 years ago

bruffridge commented 2 years ago

"one of my other major bottlenecks is pretraining weights – I’ve been training MATCH from random weight initializations every time, whereas with models like GPT-2 people just take the pretrained weights and finetune them to get state-of-the-art results. So I’ll look into either finding a way to start from pretrained MATCH weights, or finetuning GPT-2 or some such model." - Eric

pjuangph commented 2 years ago

According to @elkong the weights were empty for the original authors. This probably didn't effect them that much since their dataset was large. For us, I think we need to take a bert or GPL model, expand pretrain data with our dataset and new words so that it understand the relationship between these words. Then! we surgically add it to match.

bruffridge commented 2 years ago

This discussion I had with the author might be relevant to this:

On Friday, June 11, 2021 at 11:26 AM Yu Zhang wrote:

(1) We've also noticed the SPECTER paper recently, and we are trying to generalize its idea from citation links to different types of metadata. Sorry that we haven't made a direct comparison between MATCH and SPECTER. This will need some modification on the SPECTER code because SPECTER uses metadata for general LM fine-tuning and then performs single-label classification with text only as a downstream task. We will consider doing that comparison later. Thank you for mentioning this!

(2) One problem of applying BERT-based models here is their ability to deal with metadata. Because of the limited vocabulary, the tokenizer of SciBERT (or other BERT-based models) will split author names and reference paper IDs into meaningless subwords in most cases. I feel if one uses a model that can deal with metadata input (e.g., OAG-BERT, https://arxiv.org/abs/2103.02410, https://github.com/thudm/oag-bert), it might be helpful.

Best, Yu

On Fri, Jun 11, 2021 at 8:07 AM Ruffridge, Brandon wrote:

Hello,

Just curious if you’ve compared the performance of MATCH with SPECTER (pdf, github) for multi-label text classification. Also do you think adding SciBERT which has been trained on scientific literature (pdf, github) to MATCH would improve performance?

bruffridge commented 2 years ago

related to #24

bruffridge commented 2 years ago
  1. Transfer Learning A conventional method of training high-performance language models where domain-specific data is scarce is transfer learning. In this approach, we begin with a version of the language model, such as BERT or GPT-2, already trained on a vast amount of general-purpose text. We freeze the first several transformer layers so that their weights are not updated when gradients are backpropagated through the network, and we train the model on our domain-specific data. The intuition behind transfer learning is that the parameters of the first several transformer layers already encode a general understanding of language. Transfer learning has allowed models to achieve state-of-the-art performance on specialized domains. MATCH is a language model whose architecture includes a series of transformers. The authors note that they achieved high performance on MATCH’s metrics despite a cold-start initialization (i.e., not pre-training MATCH’s weights) because of the vast size of their training dataset. As a less well-known paper, MATCH does not have many sets of weights pre-trained on large datasets for off-the-shelf transfer learning. However, it has been trained on two datasets, MAG-CS in the computer science domain, and MeSH in the biomedical domain. Although biomimicry and biomedicine are far from the same field, their vocabularies overlap to a much greater extent than the vocabularies of biomimicry and computer science. To train MATCH on PeTaL using transfer learning, then, it may be possible to load the weights of MATCH pre-trained on MeSH, freeze the parameters of the first several transformer layers, and fine-tune the remaining parameters.