corrects some things that got munged in earlier pull
fixes get_publications() following move to reshape2 plus some other bugs
fixes get_download() following move to reshape2
Some explanation
get_publications()
There were some lines at the end of get_publications() that either didn't do anything or were killing the lists or multiple authors. In this pull request I fixed these and in fixing something left over from the move to reshape2 I reinstated the documented behaviour. However, this behaviour can result in a lot of mainly unused colums when, for example, there is a single publication with many authors. In that case, each row has as many columns in the Author info as the publication with the largest number of authors (multipled by 3 as there are multiple variables per author).
I wonder if it would be better to pull this author information out from the main data frame and return it in long format with a variable for Publication ID. It would then be possible to link from this component to the main data frame using the ID. This would involve returning a list with the following components
publications; the main data frame returned now, minus the AuthorX.X.X components
authors; a data frame in long format with the author data and a PublicationID variable.
get_download()
In fixing this for the move to reshape2 I noted a couple of issues, one of which I have tried to effect a fix for. In the examples for get_download() we have
#' # Search for sites with "Thuja" pollen that are older than 8kyr BP and
#' # that are on the west coast of North America:
#' t8kyr.datasets <- get_datasets(taxonname='Thuja*', loc=c(-150, 20, -100, 60), ageyoung = 8000)
#'
#' # Returns 3 records (as of 04/04/2013), get dataset for the first record, Gold Lake Bog.
#' GOLDKBG <- get_download(t8kyr.datasets[[1]]$DatasetID)
When forming the counts component, the TaxonName may not be unique. For example, in this data set Lycopodium tablets occurs twice in TaxonName, differentiated by the units field. However, we wish to crosstab on the TaxonName variable. When we do that, dcast() (and cast() before it) would return the data using fun.aggregate = length - i.e. count how many times each element of TaxonName was present in each sample. This probably went unnoticed because this call was wrapped in suppressMessages() and also perhaps not all data sets have non-unique TaxonName values.
From the example it seems Simon was aware that more than juts Pollen counts would be in the counts component, but if this NeotomaDB doesn't enforce unique values in TaxonName then neotoma needs to handle this. What I did here was pull out only the the rows where TaxaGroup == "Laboratory analyses" and use those for the counts component. Then I added a new component lab.data which pulled out those rows that matched `TaxaGroup == "Laboratory analyses".
This is clearly inelegant - what other values might there be in TaxaGroup? Should we expect to retrieve them all?
I think it is safe to merge this as it gets these functions working again. I will open issues for each of get_publications() and get_download() to host discussions on how to proceed with improvements down the line.
This fixes
get_publications()
following move to reshape2 plus some other bugsget_download()
following move to reshape2Some explanation
get_publications()
There were some lines at the end of
get_publications()
that either didn't do anything or were killing the lists or multiple authors. In this pull request I fixed these and in fixing something left over from the move to reshape2 I reinstated the documented behaviour. However, this behaviour can result in a lot of mainly unused colums when, for example, there is a single publication with many authors. In that case, each row has as many columns in the Author info as the publication with the largest number of authors (multipled by 3 as there are multiple variables per author).I wonder if it would be better to pull this author information out from the main data frame and return it in long format with a variable for
Publication ID
. It would then be possible to link from this component to the main data frame using the ID. This would involve returning a list with the following componentspublications
; the main data frame returned now, minus theAuthorX.X.X
componentsauthors
; a data frame in long format with the author data and aPublicationID
variable.get_download()
In fixing this for the move to reshape2 I noted a couple of issues, one of which I have tried to effect a fix for. In the examples for
get_download()
we haveWhen forming the
counts
component, theTaxonName
may not be unique. For example, in this data setLycopodium tablets
occurs twice inTaxonName
, differentiated by the units field. However, we wish to crosstab on theTaxonName
variable. When we do that,dcast()
(andcast()
before it) would return the data usingfun.aggregate = length
- i.e. count how many times each element ofTaxonName
was present in each sample. This probably went unnoticed because this call was wrapped insuppressMessages()
and also perhaps not all data sets have non-uniqueTaxonName
values.From the example it seems Simon was aware that more than juts Pollen counts would be in the
counts
component, but if this NeotomaDB doesn't enforce unique values inTaxonName
then neotoma needs to handle this. What I did here was pull out only the the rows whereTaxaGroup == "Laboratory analyses"
and use those for thecounts
component. Then I added a new componentlab.data
which pulled out those rows that matched `TaxaGroup == "Laboratory analyses".This is clearly inelegant - what other values might there be in
TaxaGroup
? Should we expect to retrieve them all?I think it is safe to merge this as it gets these functions working again. I will open issues for each of
get_publications()
andget_download()
to host discussions on how to proceed with improvements down the line.