scan-bugs-org / scan

The Symbiota2 project is an open source software project, with central goal of developing on-line tools that aid in the generation, exploration and management of biodiversity data (collection specimens, observations, images, checklist, keys, etc.). See also: http://bdj.pensoft.net/articles.php?id=1114 and http://symbiota.org/
GNU General Public License v2.0
1 stars 1 forks source link

BOM?? #38

Open neilcobb opened 4 years ago

neilcobb commented 4 years ago

Jorrit, have you used BOM https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6344444/

Is there a tool for non-programmers to help extract occurrences and associations from publications?

@seltmann @jhpoelen

jhpoelen commented 4 years ago

I have not used Biodiversity Observations Miner (BOM) personally, but other, like the first author of the paper, @fgabriel1891 used it in studies like:

Muñoz, G., Trøjelsgaard, K., & W.D. Kissling. 2019. A synthesis of animal-mediated seed dispersal of palms reveals distinct biogeographic differences in species interactions. The Journal of Biogeography. 10.1111/jbi.13493

Gabriel Muñoz. (2018, February 15). fgabriel1891/Plant-Frugivore-Interactions-SouthEastAsia v1.0 (Version v1.0). Zenodo. http://doi.org/10.5281/zenodo.1173745

In an effort to keep the scope of GloBI narrow, I imagine leaving the transcription of interactions from literature to specialists like @fgabriel1891 . These specialist might benefit from using tools like the Biodiversity Observations Miner and help (partly) automate their transcription workflows.

@neilcobb perhaps you can contact Gabriel @fgabriel1891 and learn about his (and others) experiences using the tool he helped build.

neilcobb commented 4 years ago

Thanks Jorrit, Gabriel @fgabriel1891 I am happy to move this to BiodiversityObservationsMiner . The SCAN database is impressive but for some groups it would be great to supplement the occurrence data with biotic associations data and a lot of association data (e.g, bees - plants) occurs in the literature.

jhpoelen commented 4 years ago

@neilcobb Also, note that many other projects have already undertaken the task of transcribing interaction records from literature and GloBI has indexed many of them (see e.g., https://www.globalbioticinteractions.org/references.html?interactionType=pollinates&sourceTaxon=Apis for a list of Bee pollination references). So, I'd like to encourage doing a gap analysis of existing transcribed records and considering partnerships before taking on yet another expedition into literature.

neilcobb commented 4 years ago

@jhpoelen My first option was to see if Harold Ikerd was interested in applying this to the literature database for USDA-ARS lab, they have digitized most of their pubs https://digitalcommons.usu.edu/piru/

neilcobb commented 4 years ago

@jhpoelen forgot to ask but I typed in Apidae into GLoBI and got nothing. Is it indexed such that any species in a family queried would show up?

jhpoelen commented 4 years ago

@neilcobb I was unable to reproduce your issue - I was able to get records related to Apidae (see attached screenshot). Screenshot from 2020-06-15 10-58-48 . Please provide detailed steps to help me reproduce.

My first option was to see if Harold Ikerd was interested in applying this to the literature database for USDA-ARS lab, they have digitized most of their pubs https://digitalcommons.usu.edu/piru/

Great to hear that you are thinking to partner with Harold Ikerd of USDA-ARS .

neilcobb commented 4 years ago

Sorry, I do not know why but I expected Apidae to autofill or show up in the drop-down list. When I hit return I received results.

jhpoelen commented 4 years ago

@neilcobb please provide detailed steps, so I can reproduce.

neilcobb commented 4 years ago

image

jhpoelen commented 4 years ago

@neilcobb thanks for sharing your specific example. This is a known issue https://github.com/globalbioticinteractions/globalbioticinteractions.github.io/issues/58 and am hoping that at some point, this will get resolved. You probably already noted that the list of suggested names are all Apidae .

neilcobb commented 4 years ago

@jhpoelen no problem and I did notice they were all Apidae............only ~5,000 more Apids to add!

jhpoelen commented 4 years ago

@neilcobb I added made a change to that if an exact match is available, it will be listed first. See attached screenshot. Screenshot from 2020-06-15 16-28-16

neilcobb commented 4 years ago

@jhpoelen cool, thanks Jorrit

seltmann commented 4 years ago

@neilcobb @jhpoelen we are finding with https://github.com/seltmann/bee-interaction-database that it is not so hard to pull out interactions, but many papers do not include the data with specific interactions., but only the synopsis or conclusions from the observed interactions.

neilcobb commented 4 years ago

@seltmann @jhpoelen so you do not use BOM, you just manually copy and paste?

Would be good to have a list/database of all pubs that have been searched

If you have searched USDA-ARS pubs Harold would probably like to know

fgabriel1891 commented 4 years ago

Hi @neilcobb,

Thanks for your interest in BOM. I build the tool as an effort to reduce the time spent in manually searching for species interactions in literature. The functioning of BOM is straightforward. It reads PDFs in batch and outputs text snippets were scientific names have been detected. You can further filter those snippets based on a custom dictionary of terms, particularly directed towards plant-frugivore interactions. To my experience, the benefits of using BOM lies the time gained by reading PDFs in batch and it is more useful if you search for a particular taxa in mind. At least it helps in the pruning of articles with non-relevant information. I coded BOM GUI a while ago as part of my MSc. literature thesis to make it accessible for non-programmers. However, I have now one unpublished approach geared towards BOM-like framework that runs in parallel on a multicore server, making the search retrieval of species-targeted text information more efficient, which will avoid the manual use of the GUI.

Also as @seltmann correctly points out, pairwise interaction data is very heterogeneous in literature, so it is better to start a BOM search framework with a very broad corpus of literature. Sometimes interaction data can be found inside the text of articles of an unrelated ecological topic. In addition, BOM seems to be good at finding key references, containing interaction data within those articles presenting only conclusions from observations.

That being said, indeed the using the records from the SCAN database seems to make targeted literature searches for arthropod-plant interactions seem to me like a great contribution. Despite the heterogeneous nature of species interactions descriptions in literature, I believe that you could still find valuable information in a (semi) automatized way. Furthermore, I think it is possible to link literature derived information with occurrence data to make inferences of pairwise interactions of plant-arthropods (i.e. quantifying some probability of occurrence of a pairwise interaction) using a BOM search framework as a starting point. My main research interest lies now on network assembly processes. As part of it, I'm starting to develop a methodological approach to reconstructing pairwise interactions to build meta-networks. I'll be happy to have chat anytime on how these ideas may apply to expand the SCAN database, if you are interested of course.

Best Gabriel.

neilcobb commented 4 years ago

Gabriel @fgabriel1891 ,

Thanks and specimen data is great but there is so much in the literature that could augment specimen data and biotic associations is probably at the top of the list. I hope literature mining will occur across all arthropods, but I think bees are a great signature project.

My initial intent was to someone at USDA apply them to all the pdfs created by Harold Ikerd for the USDA-ARS bee lab. In the process @jhpoelen suggested I first know what has already been reviewed and then @seltmann stated that her lab has already mined bee-plant data manually and it was pretty easy. To Jorrit's point if we could do an exhaustive search for bee-flower associations (and additional species occurrences) it would be pretty amazing. To Katja's observation, I assume you think that BOM is significantly better than just doing manual searches. So, if the goal extends from just having USDA mine their own pdfs to attempting to do this for all literature that would have data on bee associations and occurrences, how do we structure the project and would it need directed funding or could it be crowd-sourced? If it was feasible to conduct a comprehensive review then would it be easy to extend the associations to bee predators, parasites, competitors, and mutualists (symbiotic to commensal)?

You could either respond to this and/or set up a Zoom meeting and go over these.

Thanks, Neil

seltmann commented 4 years ago

@neilcobb I have a literature group that goes through and annotates literature. GloBI has all of the references we have used, so finding all of the references for bees in GloBI would give a full list of reviewed lit. There are others too including https://saveplants.org/national-collection/pollinator-search that is also indexed by GloBI.

fgabriel1891 commented 4 years ago

Hi @neilcobb

Manually extracting biotic interaction observations from literature itself is not difficult. But scaling the task of reading and sorting out articles to manual searches for information is challenging. In smaller corpus manual searches work very well, initiatives like BOM are more efficient as soon as the literature corpus becomes too big, particularly because reading and sorting out times/person can be greatly reduced. In general, using text mining frameworks versus manual search there is always a time-spent/specificity of results trade-off. But I think joint approaches can work well for biodiversity data mobilization tasks.

In a short nutshell: 1) compile machine-readable pdfs of interest; 2) Screen for species (or taxa)-level occurrences per article; 3) Do topic modeling, this can be done at the article level and at the text snippet; 4) Build a model to infer species level associations and/or manually go through the text snippets classified per taxa. 5) Clean and revise dataset; 6) Share information to data compilers () parts that can be automatized with BOM like mining frameworks.

I will say that depends on your needs, it can be crowd-sourced as well. For example, the snippet revisions can be crowdsourced. also, if going for a fully automatized search, getting the training dataset (i.e. snippets of text with known observations) can be crowdsourced as well.

Theoretically, this sort of framework can be applied to any associations reported in the literature. The challenge is to identify the appropriate keywords commonly used as descriptors of each of the different interaction types. That can be done with the help of experts. However, biotic associations in literature are reported very heterogeneous, and terms overlap with each other. This complicates the smoothness of applying a completely automatized mining framework. But is certainly possible.

I hope I could briefly answer your questions. I'll ve happy to talk further over zoom if needed.

Best, Gabriel.

neilcobb commented 4 years ago

Gabriel,

Thanks and my action items will be

  1. Review BOM more carefully

  2. Organize a group to tackle this problem

  3. Set up a Zoom with you in about two weeks to coincide with having a critical mass of questions about the best way to proceed.

Regardless, I assume I want to:

  1. Create a pdf library of machine-readable pdfs

  2. Develop search terms

Cheers, Neil

From: Gabriel Muñoz notifications@github.com Sent: Wednesday, June 17, 2020 7:47 AM To: scan-bugs-org/scan scan@noreply.github.com Cc: Neil Stanley Cobb Neil.Cobb@nau.edu; Mention mention@noreply.github.com Subject: Re: [scan-bugs-org/scan] BOM?? (#38)

Hi @neilcobbhttps://github.com/neilcobb

Manually extracting biotic interaction observations from literature itself is not difficult. But scaling the task of reading and sorting out articles to manual searches for information is challenging. In smaller corpus perhaps manual searches work very well, initiatives like BOM are more efficient as soon as the literature corpus becomes too big, particularly because reading and sorting out times/person can be greatly reduced. In general, using text mining frameworks versus manual search there is always a time-spent/specificity of results trade-off. But I think joint approaches can work well for biodiversity data mobilization tasks.

In a short nutshell: 1) compile machine-readable pdfs of interest; 2) Screen for species (or taxa)-level occurrences per article; 3) Do topic modeling, this can be done at the article level and at the text snippet; 4)* Build a model to infer species level associations and/or manually go through the text snippets classified per taxa. 5) Clean and revise dataset; 6) Share information to data compilers

-If it was feasible to conduct a comprehensive review then would it be easy to extend the associations to bee predators, parasites, competitors, and mutualists (symbiotic to commensal)?

Theoretically, this sort of framework can be applied to any associations reported in the literature. The challenge is to identify the appropriate keywords commonly used as descriptors of each of the different interaction types. That can be done with the help of experts. However, biotic associations in literature are reported very heterogeneous, and terms overlap with each other. This complicates the smoothness of applying a completely automatized mining framework.

I hope I could briefly answer your questions. I'll ve happy to talk further over zoom if needed.

Best, Gabriel.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/scan-bugs-org/scan/issues/38#issuecomment-645420223, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNWKKSUX7KEHDQNLAYSCQ3RXDJM3ANCNFSM4N6FDYTQ.

fgabriel1891 commented 4 years ago

Hi @neilcobb

Sounds good! Please let me know any questions and/or issue you find. Maybe it will work better if we communicate by email from now on: gabriel.munoz@concordia.ca. Like that this issue can be closed.

(*) Just a quick remark, please if you plan to examine BOM in the near future please use the local version rather than the server based. (i.e. download the git-repo and run it on your computer).

Many thanks, Gabriel.

jhpoelen commented 4 years ago

Great to hear that there's an effort to re-use existing tools and established workflows.

Suggest you might also want to consult with Biodiversity Heritage Library (in collaboration with Global Names) and Plazi . These two organizations have many years of experience in operating workflows that help extract terms from managed, machine-readable, texts and pdfs. For some recent activities, please see https://github.com/globalbioticinteractions/zenodo-metadata/issues/1 .