A potential successor to uBioRSS and uBioRSS Nomina Nova. See also my experience with bioGUID a decade or more ago.
Take RSS feeds from journals and databases, creating them if needed, then index by taxon and geography. Output RSS feeds keyed by taxon and/or geography. Create simple visualisations.
Original goal was to rely on RSS feeds, or generate my own RSS from various sources. Now seems better to use RSS if available, but otherwise generate schema.org-style JSON and use that directly for other, potentially richer sources.
RSS feeds from journals regularly polled and added. RSS converted to “internal” format, then augmented by adding DOIs, geography and taxa. Store the status of each feed in feed status.json
. Sadly many RSS feeds don’t support conditional GET.
Some sources (e.g., Google Scholar, ZooBank) will be converted directly to “internal” format, then augmented.
Feed item is modelled as a schema.org DataFeedItem
with the publication as an item
.
Need to set up the harvesting to be automatic. Would be nice to cache things for reanalysis if needed.
harvest-feeds.php to read feed list and cache each feed as an XML file in the folder
cache/latest` (an alias).process-rss-feeds.php
parses each XML file in cache/latest
and for each item adds it to the data store, then augments that, updating the item in the data store.harvest-gs-email.php
reads any .eml
files in folder cache/latest
and converts them to native JSON.process—internal.php
parses each JSON file in cache/latest
and for each item adds it to the data store, then augments that, updating the item in the data store.harvest-doaj.php
fetches journal articles using ISSN as key, then generates native JSON from the BibJSON the DOAJ API returns.process—internal.php
parses each JSON file in cache/latest
and for each item adds it to the data store, then augments that, updating the item in the data store.Google Scholar can send email alerts for a search term, so an obvious approach is to use these alerts as a source. How do we do this? One approach is to use a service such as CloudMailin which can take an email sent to a CloudMailin email address and forward that email as a JSON document to a URL (webhook). We can then parse the contents of the email. For debugging purposes we can use a service such as PostBin to receive these emails, for example https://postb.in/1632815014159-2470838529989. When using PostBin note that you can retrieve the body of the request using a URL like https://postb.in/api/bin/[bin-id]/req/[request-id].
The Google Scholar alert email is in HTML so we need to parse it and extract the information we require. Note that Google Scholar doesn’t include DOIs in the results, so we may have to resolve URLs and go hunting for DOIs. Some links may be PDFs, ideally we can find the corresponding HTML link so that we can parse that.
Lyubo mentions OAI endpoint, investigate further.
PubMed supports the creation of RSS feeds based on user searches, e.g. ("new species") OR ("n. sp.") OR ("sp. nov.") OR ("n. gen.") OR ("gen. nov.") OR ("n. comb.") OR ("comb. nov.”)
Scrape using JSON.
ZooBank has RSS but it doesn’t seem to be updated(?). Can also query using year as a search term. JSON data doesn’t have precise time, nor does it have the DOI. GBIF https://www.gbif.org/dataset/c8227bb4-4143-443f-8cb2-51f9576aff14 https://doi.org/10.15468/wkr0kn seems to lag behind ZooBank.
Zootaxa has RSS feeds, but also has a taxon search feature, e.g., https://www.mapress.com/zt/search/search?query=Coleoptera&authors=&dateFromYear=2021&dateFromMonth=11&dateFromDay=&dateToYear=&dateToMonth=&dateToDay=&subject=&title=&abstract=&indexTerms= which might be used to generate taxon-specific feeds.
RSS feeds are variable in terms of tags included and how they handle external namespaces. Note also that dates in RSS feeds need not be in English, which means we need to translate them before converting to ISO8601.
Data | Validation tool |
---|---|
JSON-LD | https://json-ld.org/playground/ |
OPML | http://validator.opml.org |
RSS feed | https://validator.w3.org/feed/ |
Structured data using schema.org | https://validator.schema.org |
Feed is a list in descending time order, taxon facet is a treemap, geography facet is a map.
Experimenting with simple full text search based on Inside Wade, source code on GitHub. Uses a CouchDB view to convert text to list of terms then query that view to return a list of documents sorted by how well they match the query.
Patrick R. Leary, David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109
Little, D. P. (2020). Recognition of Latin scientific names using artificial neural networks. Applications in Plant Sciences, 8(7). doi:10.1002/aps3.11378
Mindell, D. P., Fisher, B. L., Roopnarine, P., Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating, Tagging and Integrating Biodiversity Research. PLoS ONE, 6(8), e19491. doi:10.1371/journal.pone.0019491