Explore tokenizer optimization strategies to improve precision/recall.

Optimization Strategies:

Paht's research indicates there may be a 25,000 token limit in SciBERT/Huggingface. Is this limitation present in MATCH's tokenizer? Is 25,000 tokens enough to cover the vocabulary we are dealing with. Would increasing this number improve precision/recall?
See if using stemming/lemmatization to reduce tokens improves precision/recall.
Input sequences truncated to 500 tokens in MATCH. This may result in truncated abstracts (not an issue in our current dataset, see Eric's comment below).

[x] See if using the full abstract improves precision/recall, either by chunking, or stemming/lemmatization to reduce the number of tokens. (not an issue at this time)

Eric: MATCH pads/truncates its input token sequences to a default of 500 tokens (see truncate_text in auto-labeler/MATCH/src/MATCH/deepxml/data_utils.py); my scripts don't mess with that. My intuition is that we don't need the full abstract and the first 500 words (minus the number of metadata tokens) should be enough? Eric: Yes, the full abstract would be strictly more useful than only part of it. However, I think the difference is negligible in our dataset; only 2 of 1149 papers in my training and test sets have a token sequence (metadata + title + abstract) longer than 500 tokens.

Paht: there's several ways of handling this.

We break the abstract down to 500 word chunks and give all chunks the same labels

nasa-petal / PeTaL-labeller

Explore tokenizer optimization strategies to improve precision/recall. #84