Made during GSoC 21 for WormBase
pip3 install wheel
pip3 install -r requirements.txt
Follow the notebooks serially.
In 100 papers tested (93 were in the manually curated ground truth file), gene-mutation matches were found in 53 papers.
Total 2433 matches were present in those 53 papers. And 977 matches were found using this developed pipeline.
TP: 472, FP: 505
Precision: 48.3%
After manually checking the false positives and updating the ground truth file -
TP: 807, FP: 170
Precision: 82.59%
Not all FP are FP. After manual verification of the final output, some were noticed to be true positive which were originally missed during the manual curation.
These ideas, while interesting, were not possible during the two-month coding period. If worked on, they might improve recall by a huge margin.
Majority of the extracted mutations from notebook 2 are being ignored (almost 73% of outputs during development). This is partly due to limitation in the mutation normalization block in notebook 3 and the subpar predictions from NER due to being trained on limited mutation data in natural language form.
More additional data which will help curators in final verification.
Faster and leaner.
This project would not have been possible without their support.
Magdalena Zarowiecki
Paul Davis
Valerio Arnaboldi
@article{https://doi.org/10.17912/micropub.biology.000578,
doi = {10.17912/MICROPUB.BIOLOGY.000578},
url = {https://www.micropublication.org/journals/biology/micropub-biology-000578},
author = {Mallick, Rishab and Arnaboldi, Valerio and Davis, Paul and Diamantakis, Stavros and Zarowiecki, Magdalena and Howe, Kevin},
title = {Accelerated variant curation from scientific literature using biomedical text mining},
publisher = {microPublication Biology},
year = {2022}
}