In fixing get_download() after the move to reshape2 I noted a couple of issues, one of which I have tried to effect a fix for in #29. In the examples for get_download() we have
#' # Search for sites with "Thuja" pollen that are older than 8kyr BP and
#' # that are on the west coast of North America:
#' t8kyr.datasets <- get_datasets(taxonname='Thuja*', loc=c(-150, 20, -100, 60), ageyoung = 8000)
#'
#' # Returns 3 records (as of 04/04/2013), get dataset for the first record, Gold Lake Bog.
#' GOLDKBG <- get_download(t8kyr.datasets[[1]]$DatasetID)
When forming the counts component, the TaxonName may not be unique. For example, in this data set Lycopodium tablets occurs twice in TaxonName, differentiated by the units field. However, we wish to crosstab on the TaxonName variable. When we do that, dcast() (and cast() before it) would return the data using fun.aggregate = length - i.e. count how many times each element of TaxonName was present in each sample. This probably went unnoticed because this call was wrapped in suppressMessages() and also perhaps not all data sets have non-unique TaxonName values.
From the example it seems Simon was aware that more than juts Pollen counts would be in the counts component, but if this NeotomaDB doesn't enforce unique values in TaxonName then neotoma needs to handle this. What I did here was pull out only the the rows where TaxaGroup == "Laboratory analyses" and use those for the counts component. Then I added a new component lab.data which pulled out those rows that matched `TaxaGroup == "Laboratory analyses".
This is clearly inelegant - what other values might there be in TaxaGroup? Should we expect to retrieve them all?
How should such situations be handled in get_download()?
In fixing
get_download()
after the move to reshape2 I noted a couple of issues, one of which I have tried to effect a fix for in #29. In the examples forget_download()
we haveWhen forming the
counts
component, theTaxonName
may not be unique. For example, in this data setLycopodium tablets
occurs twice inTaxonName
, differentiated by the units field. However, we wish to crosstab on theTaxonName
variable. When we do that,dcast()
(andcast()
before it) would return the data usingfun.aggregate = length
- i.e. count how many times each element ofTaxonName
was present in each sample. This probably went unnoticed because this call was wrapped insuppressMessages()
and also perhaps not all data sets have non-uniqueTaxonName
values.From the example it seems Simon was aware that more than juts Pollen counts would be in the
counts
component, but if this NeotomaDB doesn't enforce unique values inTaxonName
then neotoma needs to handle this. What I did here was pull out only the the rows whereTaxaGroup == "Laboratory analyses"
and use those for thecounts
component. Then I added a new componentlab.data
which pulled out those rows that matched `TaxaGroup == "Laboratory analyses".This is clearly inelegant - what other values might there be in
TaxaGroup
? Should we expect to retrieve them all?How should such situations be handled in
get_download()
?