Speed expectation -- what is "fast"?

serenalotreck commented 1 year ago

I'm using TaxoNERD on a large number of academic paper abstracts, and I'm curious what the baseline for "fast" is as mentioned in the docs. I'm finding that it takes between 1-30 seconds to classify the entities in an abstract with prefer_gpu=True. It appears to scale with the number of entities that are identified in the text. That seems quite slow to me -- is that on par with your observations of TaxoNERD's speed, or is there something I'm doing wrong?

EDIT: Here's a code snippet of exactly what I'm loading:

taxonerd = TaxoNerd(prefer_gpu=True)
nlp = taxonerd.load(model='en_core_eco_biobert', linked='ncbi_taxonomy', threshold=0.7)

for abstract in mypapers:
    ent_df = taxonerd.find_in_text(abstract)

Thanks!

nleguillarme commented 1 year ago

Hi @serenalotreck, thank you for using TaxoNERD.

Actually, TaxoNERD is quite fast at recognising taxonomic entities in text (especially if you have some GPUs available to speed up the execution of larger models), but linking entities to reference taxonomies is considerably slower, and I think this explains your observations.

At the moment I have not found a faster method for entity linking that can handle large taxonomies like the NCBI or GBIF taxonomies.

Either way, I'm always interested in hearing about use cases for information extraction in biology/ecology, so feel free to contact by email if you'd like to chat.

serenalotreck commented 1 year ago

Thanks so much! Do you think there's a way to extract all the entities in one pass, and then later do the linking in batches to speed it up? Will email so we can chat further!

nleguillarme / taxonerd

Speed expectation -- what is "fast"? #19