ropensci / helminthR

Accesses parasite occurrence records from the London Natural History Museum's Host-Parasite database, which contains over a quarter of a million helminth records.
https://docs.ropensci.org/helminthR
GNU General Public License v3.0
7 stars 5 forks source link

add function to query citations #22

Open arw36 opened 4 years ago

arw36 commented 4 years ago

Currently, references are provided as a url link for each occurrence. This makes it difficult to synthesize the primary literature for each interaction and likely leads to helminthR users only citing LMNH and helminthR, rather than the data publishers (similar to other data aggregation platforms such as GBIF, see Escribano et al. 2018). This function allows a user to input a previous occurrence query and output the relevant primary literature.

An outstanding issue is that the LMNH website batches references to 30 articles per page. Currently, this will only synthesize the first 30 references per occurrence. I'm hoping this was an issue for the other helminthR functions, and you might have a solution already?

I think there could be several improvements to this, for instance linking primary literature back to specific interactions rather than full search queries. For now, I think this is a good first step.

Escribano N, Galicia D, Ariño AH. The tragedy of the biodiversity data commons: a data impediment creeping nigher?. Database. 2018 Apr 9;2018:bay033.

taddallas commented 4 years ago

This looks fantastic! Thanks for your work on this. I don't think it's quite ready to merge into the package now, but I think it'll be a really nice addition. A couple of things that need to be worked out:

Let me know what you think about this, and let me know how I can help. I can look into the 30 citation limit, but this may not a large problem if each host-helminth interaction is queried separately (as associations tend to be based on 1-5 citations).

Thanks again for your work on this. :)

arw36 commented 4 years ago

Thanks for the feedback. I'll work on those edits this week.

One solution for interactions with > 30 references could simply be to include a warning that some references are cut off and you can go to url to manually get. This would only be for those uncommon cases (e.g. foxes, pig).

taddallas commented 4 years ago

I also just noticed a workaround for the 30 citations bit. The structure of the call can match the existing find functions, with some minor modifications.

    url <- "http://www.nhm.ac.uk/research-curation/scientific-resources/taxonomy-systematics/host-parasites/database/references.jsp;"
    args <- list(dbfnsRowsPerPage = "500000", x = "13", y = "5", 
        paragroup = group, fmsubgroup = "Contains", subgroup = subgroup, 
        fmparagenus = "Contains", paragenus = genus, fmparaspecies = "Contains", 
        paraspecies = species, fmhostgenus = NULL, hostgenus = NULL, 
        fmhostspecies = NULL, hostspecies = NULL, location = location, 
        hstate = hostState, pstatus = NULL, showparasites = "on", 
        showhosts = "on", showrefs = "on", groupby = "parasite", 
        search = "Search")
    hp <- GET(url, query = args)

I haven't checked, but I think the above code should pull the information for all associated citations for a given query (host/parasite info). The above example is pulled from the findParasite function,but just changing the base URL.

If possible, can we also get around the new imports (e.g., tidyr and reshape2)? The package requires a bunch of dependencies already, I think due to rvest requiring a bunch of tidyverse-esque stuff, but I'm not certain.

Thanks again for your work on this. Sorry I didn't notice the similar call for references earlier. Hopefully this helps, though I don't know if it's best to have findCitations take the same arguments as the other find functions or the set of interactions (as your code currently does).

arw36 commented 4 years ago

I removed the reshape2 and stringr dependency. I'm not sure of base equivalents of tidyr long to wide conversion? These tidyr functions were added to filter by the reference comments which hold some important annotations like if a reference is a non-original source.

I'll have to play around with querying the url directly. I'm preferential to the references being linked to a previous query's interactions as it links outputs more directly.