nasa-petal / PeTaL-labeller

The PeTaL labeler labels journal articles with biomimicry functions.
https://petal-labeller.readthedocs.io/en/latest/
The Unlicense
6 stars 3 forks source link

Explore tokenizer optimization strategies to improve precision/recall. #84

Open bruffridge opened 3 years ago

bruffridge commented 3 years ago

Optimization Strategies:

  1. Paht's research indicates there may be a 25,000 token limit in SciBERT/Huggingface. Is this limitation present in MATCH's tokenizer? Is 25,000 tokens enough to cover the vocabulary we are dealing with. Would increasing this number improve precision/recall?

  2. See if using stemming/lemmatization to reduce tokens improves precision/recall.

  3. Input sequences truncated to 500 tokens in MATCH. This may result in truncated abstracts (not an issue in our current dataset, see Eric's comment below).

Eric: MATCH pads/truncates its input token sequences to a default of 500 tokens (see truncate_text in auto-labeler/MATCH/src/MATCH/deepxml/data_utils.py); my scripts don't mess with that. My intuition is that we don't need the full abstract and the first 500 words (minus the number of metadata tokens) should be enough? Eric: Yes, the full abstract would be strictly more useful than only part of it. However, I think the difference is negligible in our dataset; only 2 of 1149 papers in my training and test sets have a token sequence (metadata + title + abstract) longer than 500 tokens.

Paht: there's several ways of handling this.

  • We break the abstract down to 500 word chunks and give all chunks the same labels