Open ivalexander13 opened 3 years ago
Sina and I got our data reformatted finally after a couple hours, in mar12_NER/20210326_set_up_NER_runs_with_dividers.ipynb -- data was saved to data/ner/chemprot_sub_enzyme/clean/{dev, train, test}.txt
We ran it yesterday but keep getting low f1s, so I'm going to start working on seeing if we can use bits and pieces of the SciBERT model to include class_weights - more coming
how we ran it for testing (didn't want to use compute hours):
source activate /global/home/groups/fc_igemcomp/software/scibert_env_NER
cd fc_igemcomp/2020_nlp/scibert
rm -R scripts/NER_output_26mar/
./scripts/train_allennlp_local_v3_NER_trial.sh ./scripts/NER_output_26mar/
creating new kernel:
source activate ~/fc_igemcomp/software/scibert_env_NER
# very important! # can also use conda
( had to install ipykernel: conda install -p /global/home/groups/fc_igemcomp/software/scibert_env_NER ipykernel
)
python -m ipykernel install --user --name python3.6.13_ner_scibert --display-name "Python 3.6.13 (scibert_env_NER)"
# display name is what will show
6:20pm: issue with some iProgress module, so ran these
conda install -c conda-forge ipywidgets
jupyter nbextension enable --py widgetsnbextension
(cool! now TQDM works in-notebook)
okay, my plan: what I want to do:
uguguguguugugg we need to modify the loss if we want the model to LEARN these weights though
train: scripts/0403_train_allennlp_local_NER_few_epochs.sh scripts/NER_output_3apr/
oof okay switching to local to make changes to AllenNLP - will try to set up similar structure of files on Savio and sync to GitHub
ah sike - we realized it's not bert_text_classifier it uses for the NER set, but rather the bert_crf_tagger.py file - will try to see if we can modify that to use class weights instead!
https://github.com/kmkurn/pytorch-crf/issues/47 is helpful, and files to modify include ner_finetune.json, allennlp CRF class, and the bert_crf_tagger.py file.
did some more checks into how people have fixed imbalanced data issues in AllenNLP before. Seems like there is no generalized solution according to this thread.
Mrunali and my experiments with directly modifying the weights haven't made a big difference to performance so far, might be missing something though.
looking into modifying CRFs to be weighted: mathy paper that says basically we should compute a double sum for loss so we can weight the classes https://perso.uclouvain.be/michel.verleysen/papers/ieeetbe12gdl.pdf: seems to have kind of decent results? hadn't thought about L1 regularization.
from: https://github.com/allenai/allennlp/issues/4619
someone said "I mean, I believe it can work in practice, but their theoretical motivation is not correct. If this is the case, we could do it with a much simpler approach (like weighted emission scores)." which is what we did...: https://github.com/tensorflow/addons/issues/817
okay, I'm just going to keep a running list of updates in this comment on other comments/potential implementations
{in any case can you tell how much fun I'm having with GitHub issues lmao}
This textbook chapter from my NLP class actually goes over what we have concluded as being a good approach to solving this problem which I thought was validating (i.e. NER/Relation Extraction + semi-supervised approach) https://web.stanford.edu/~jurafsky/slp3/17.pdf
This textbook chapter from my NLP class actually goes over what we have concluded as being a good approach to solving this problem which I thought was validating (i.e. NER/Relation Extraction + semi-supervised approach) https://web.stanford.edu/~jurafsky/slp3/17.pdf
Is the semi-supervised approach the approach you're/they're thinking of? It does seem really cool and it seems to have decent track record, though we'd probably need to rewrite a lot of code. Do you think this is something worth pursuing?
Yeah take a look at 17.2.4 in there (distant supervision for relation extraction). It sounds very similar to the pattern recognition technique we've been talking about, except it learns non-regex patterns for features (or aggregates data to be fed into NN directly without extracting features beforehand). Problem is that it generally has low precision, which is similar to the other paper we read using pattern matching, so not sure what the best solution is for us.
Trying to rebalance the data (with 12apr
/20210412
notebook + script) so as to remove any sentences without entities/labels of interest, but the F1 does not change considerably :(
praise Ivan who modified a hugging face implementation (in his scratch folder, /global/scratch/ivalexander13/NLPChemExtractor/scibert-text-classification/main.ipynb
, but also in /global/home/groups/fc_igemcomp/2020_nlp/scibert/apr16_huggingface_NER
)
revised TODOs:
revised TODOs:
HuggingFace NER:
- try further regularization: dropout + early stopping - don't use loss, use F1 + AUC/ROC https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5 (Ivan) to pick the desired threshold, maybe if it's in the hugging face library?
- look into playing with loss: weights? https://huggingface.co/transformers/training.html, huggingface/transformers#7024
HuggingFace QA:
- there's no good RE implementation in hugging face, so maybe QA is better? https://huggingface.co/transformers/usage.html - a brief test suggests maybe it's not the best with a normal BERT model but we can to integrate use of sciBERT
I'm working on this at #26
Overview
We are doing this to compare SciBERT's performance on NER, relative to text classification. SciBERT didn't provide a chemprot dataset for NER, so we are using the chemprot dataset straight from its source (link here?) and formatting it to fit the model's NER task.
Attempt (ongoing)
We are in the middle of converting the source chemprot dataset, and doing part-of-speech tagging on each word, as well as connecting the relevant entities (substrate, product, and enzyme).
Plans
We will do the full 75 epoch training on this dataset, and see how it performs.