Look into how we might use SPECTER to improve our labeller

bruffridge commented 3 years ago

They talked a lot about using the SPECTER embeddings and doing nearest neighbor. It involved using this https://allenai.org/data/s2orc which has 100,000,000 papers. If they only used the ones with PubMed identifiers, it would be "only" 20,000,000. With this, they were thinking of using simple classification methods like logistic regression

https://arxiv.org/pdf/2004.07180.pdf

https://github.com/allenai/specter

https://huggingface.co/allenai/specter

bruffridge commented 3 years ago

According to the SPECTER paper, SPECTER outperforms SciBERT (even fine-tuned SciBERT) at text classification.

A paper’s title and abstract provide rich semantic content about the paper, but, as we show in this work, simply passing these textual fields to an “off-the-shelf” pretrained language model—even a state-of-the-art model tailored to scientific text like the recent SciBERT (Beltagy et al., 2019)—does not result in accurate paper representations. The language modeling objectives used to pretrain the model do not lead it to output representations that are helpful for document-level tasks such as topic classification or recommendation.

We specifically use citations as a naturally occurring, inter-document incidental supervision signal indicating which documents are most related and formulate the signal into a triplet-loss pretraining objective. Unlike many prior works, at inference time, our model does not require any citation information

SPECTER still outperforms a SciBERT model fine-tuned on the end tasks as well as their multitask combination, further demonstrating the effectiveness and versatility of SPECTER

SPECTER embeddings are based on only the title and abstract of the paper. Adding the full text of the paper would provide a more complete picture of the paper’s content and could improve accuracy (Co- hen et al., 2010; Lin, 2008; Schuemie et al., 2004). However, the full text of many academic papers is not freely available. Further, modern language models have strict memory limits on input size, which means new techniques would be required in order to leverage the entirety of the paper within the models. Exploring how to use the full paper text within SPECTER is an item of future work

pjuangph commented 3 years ago

Thanks, I'm looking into this and trying their code. Will let you know

pjuangph commented 3 years ago

I looked into Specter and it seems buggy. I filed an issue with their GitHub. https://github.com/allenai/specter/issues/27

@hschilling @bruffridge @ARalevski @CkUnsworth To use specter with our dataset, we need to pull the citations from every document we are using. Instead of simply looking at title and abstract, we need another file that contains a UUID for the document with citations to other document uuids. See example here: https://github.com/allenai/specter#How-to-reproduce-our-results

There are some bugs with specter and I'll try to get training to work but we may need to contact them to find out how they fixed the issue above.

hschilling commented 3 years ago

Thanks for looking into that.

nasa-petal / PeTaL-labeller

Look into how we might use SPECTER to improve our labeller #41