ropensci / neotoma

Programmatic R interface to the Neotoma Paleoecological Database.
https://docs.ropensci.org/neotoma
Other
30 stars 16 forks source link

NeotomaDB can return non-count data and non-unique TaxonName data that we crosstab and return as component `counts` #32

Open gavinsimpson opened 11 years ago

gavinsimpson commented 11 years ago

In fixing get_download() after the move to reshape2 I noted a couple of issues, one of which I have tried to effect a fix for in #29. In the examples for get_download() we have

#' #  Search for sites with "Thuja" pollen that are older than 8kyr BP and
#' #  that are on the west coast of North America:
#' t8kyr.datasets <- get_datasets(taxonname='Thuja*', loc=c(-150, 20, -100, 60), ageyoung = 8000)
#'
#' #  Returns 3 records (as of 04/04/2013), get dataset for the first record, Gold Lake Bog.
#' GOLDKBG <- get_download(t8kyr.datasets[[1]]$DatasetID)

When forming the counts component, the TaxonName may not be unique. For example, in this data set Lycopodium tablets occurs twice in TaxonName, differentiated by the units field. However, we wish to crosstab on the TaxonName variable. When we do that, dcast() (and cast() before it) would return the data using fun.aggregate = length - i.e. count how many times each element of TaxonName was present in each sample. This probably went unnoticed because this call was wrapped in suppressMessages() and also perhaps not all data sets have non-unique TaxonName values.

From the example it seems Simon was aware that more than juts Pollen counts would be in the counts component, but if this NeotomaDB doesn't enforce unique values in TaxonName then neotoma needs to handle this. What I did here was pull out only the the rows where TaxaGroup == "Laboratory analyses" and use those for the counts component. Then I added a new component lab.data which pulled out those rows that matched `TaxaGroup == "Laboratory analyses".

This is clearly inelegant - what other values might there be in TaxaGroup? Should we expect to retrieve them all?

How should such situations be handled in get_download()?