plazi / BLR-website

1 stars 0 forks source link

use case: Guardian article about new species published by NHM #63

Open myrmoteras opened 3 years ago

myrmoteras commented 3 years ago

This is a very interesting use case for BLR: https://www.theguardian.com/environment/2020/dec/30/moths-to-monkeys-503-new-species-identified-by-uk-scientists

The report is very simple: The NHM published descriptions of 503 new species.
the use case too: Plazi provides the links to and the treatments of all the 503 new species.

eg. https://biolitrepo.org/?facets=true&journalYear=2020&page=0&q=NHM&resource=treatments&stats=true&type=all

The issue of course is a bit more complex, because

myrmoteras commented 3 years ago

http://tb.plazi.org/GgServer/srsStats/stats?outputFields=doc.uuid+bib.year+tax.status+matCit.collectionCode&groupingFields=bib.year+tax.status+matCit.collectionCode&FP-bib.year=2020&FP-matCit.collectionCode=NHM&format=HTML

how can we find out new sp. with collector affiliation? here is an approach: http://tb.plazi.org/GgServer/dioStats/stats?outputFields=doc.doi+bib.year+bib.source+auth.aff+treat.id+treat.status&groupingFields=bib.year+bib.source+auth.aff+treat.id+treat.status&FP-bib.year=2020&FP-auth.aff=%25Cromwell%25&FP-treat.status=%22sp.%20nov.%22&format=HTML

hat's a complex one because the authors of the paper might not be the authors of the new species in the paper

so I would make an API call to get all affiliations of authors, and authority names, of all treatments that has the status sp.n.

then I would remove the authors that are not in the authority name

and go from there

problem is, we don't mark parts of the affiliation, but the affiliation as a whole string

so we don't have, say, an attribute named 'institution' and another one named 'address'

this means that we can't already filter the institution using the API, we need external logic to accomplish that

(98/100) papers scheduled again

so, I see three steps in this service

sorry, four

for the extraction part, we don't cover 100% of the literature, so we might miss some n. spp. from a particular museum. we have to keep that in mind.

for the gathering part, this is done - the api is available and we can retrieve what I described.

the data manipulation is not complicated neither costy at all. there is no learning curve on my side (it's something that I've, to some extent, done before) and could be accomplish with any programming language.

the data visualization is the trickest part.

We live in the world of dashboards now, and we should start producing them to our 'clients' (publishers, museums)

but this is not something that any of us work with directly.

I've started, with google data studio

but there are other and better players in the market, like qlik sense and tableau

these are software that can get in live data and then display interact-able charts, like the one I did for EJT

the good part is, once we have done for one, we can adapt quickly for any other

now, side note

and what if we use 2021 to rebuild Plazi website and start bringing these stats to live? like, which museum has the most n.spp. in 2021, which museum published the most in closed access journals, and so on

that would not hamper the service (which breaks down the stats)

but would bring to life this competition

based on data

just like we go to that clarivate wbesite to get a sense on the most important jorunals for taxonomy

people would have to go to plazi to understand what's the most relevant institution and their publishing behavior