nasa-petal / PeTaL-labeller

The PeTaL labeler labels journal articles with biomimicry functions.
https://petal-labeller.readthedocs.io/en/latest/
The Unlicense
6 stars 3 forks source link

Look into using MATCH to improve the labeler #42

Closed bruffridge closed 3 years ago

bruffridge commented 3 years ago

https://github.com/yuzhimanhua/MATCH

https://arxiv.org/pdf/2102.07349.pdf

bruffridge commented 3 years ago

May be able to include MAG and MeSH labels as metadata to improve performance: https://github.com/yuzhimanhua/MATCH/issues/3

bruffridge commented 3 years ago

Subtask: https://github.com/nasa-petal/PeTaL-labeller/issues/43

bruffridge commented 3 years ago

We'll have to verify that adding author and venue metadata helps (not hurts) performance.

MATCH uses citations, venue, and authors as metadata.

SPECTER only uses citations. Here's their explanation:

More surprisingly, adding authors as an input (along with title and abstract) hurts performance. One possible explanation is that author names are sparse in the corpus, making it difficult for the model to infer document-level relatedness from them. As another possible reason of this behavior, tokenization using Wordpieces might be suboptimal for author names. Many author names are out-of-vocabulary for SciBERT and thus, they might be split into sub-words and shared across names that are not semantically related, leading to noisy correlation. Finally, we find that adding venues slightly decreases performance, except on document classification (which makes sense, as we would expect venues to have high correlation with paper topics).

bruffridge commented 3 years ago

Log metrics such as precision, recall, f-score, and confusion matrix, to compare the effects of different algorithms, parameters, datasets, and metadata.

elkong commented 3 years ago

Okay, will look into this! This is new to me, so what else do we know already about MATCH vs. SPECTER vs. other approaches?

bruffridge commented 3 years ago

@dsmith111 finished reformatting our labelled dataset for MATCH training/testing. JSON file is here: https://github.com/nasa-petal/PeTaL-labeller/blob/main/scripts/lens-cleaner/cleaned_lens_output.json