igematberkeley / NLPChemExtractor

3 stars 0 forks source link

SciBERT - Text Classification #7

Open ivalexander13 opened 3 years ago

ivalexander13 commented 3 years ago

Overview

We found that SciBERT works better than BioBERT in text classification tasks. SciBERT has a pre-programmed text classification task on a chemprot dataset, which contains sentences matched with relationships (product-of, antagonist, etc).

Attempt

We generated an alternative dataset that's derived from the given chemprot dataset. Specifically, we masked all the non-relevant relationships and only kept the relevant ones (substrate, product-of, substrate-product-of). We trained the model on both datasets for a full 75 epochs.

Results

We have done the training on the both chemprot datasets, and it performed really poorly. The precision, recall, and F1 scores were all stagnant at low levels at around epoch # 5, and never improved throughout the next 70 epochs. The scores for some of the labels also seem to be zero, which suggests these labels were never predicted (zero true positive).

The Issue

We're trying to see exactly why this performance problem exists and how to fix it. The options might be to either dig into the AllenNLP model code and debug it from there, or to reverse engineer a text classifier with huggingface's transformers.

mrunalimanj commented 3 years ago

Some updates:

We tried adding weights to the CrossEntropyLoss but realized there's no negative label for this input dataset... so we need to reformat the dataset to include some number of negative entries? and then we can try to optimize with different weights