ropensci / neotoma

Programmatic R interface to the Neotoma Paleoecological Database.
https://docs.ropensci.org/neotoma
Other
30 stars 16 forks source link

More fixes inter alia for move to reshape2 #29

Closed gavinsimpson closed 11 years ago

gavinsimpson commented 11 years ago

This fixes

  1. corrects some things that got munged in earlier pull
  2. fixes get_publications() following move to reshape2 plus some other bugs
  3. fixes get_download() following move to reshape2

    Some explanation

    get_publications()

There were some lines at the end of get_publications() that either didn't do anything or were killing the lists or multiple authors. In this pull request I fixed these and in fixing something left over from the move to reshape2 I reinstated the documented behaviour. However, this behaviour can result in a lot of mainly unused colums when, for example, there is a single publication with many authors. In that case, each row has as many columns in the Author info as the publication with the largest number of authors (multipled by 3 as there are multiple variables per author).

I wonder if it would be better to pull this author information out from the main data frame and return it in long format with a variable for Publication ID. It would then be possible to link from this component to the main data frame using the ID. This would involve returning a list with the following components

  1. publications; the main data frame returned now, minus the AuthorX.X.X components
  2. authors; a data frame in long format with the author data and a PublicationID variable.

    get_download()

In fixing this for the move to reshape2 I noted a couple of issues, one of which I have tried to effect a fix for. In the examples for get_download() we have

#' #  Search for sites with "Thuja" pollen that are older than 8kyr BP and
#' #  that are on the west coast of North America:
#' t8kyr.datasets <- get_datasets(taxonname='Thuja*', loc=c(-150, 20, -100, 60), ageyoung = 8000)
#'
#' #  Returns 3 records (as of 04/04/2013), get dataset for the first record, Gold Lake Bog.
#' GOLDKBG <- get_download(t8kyr.datasets[[1]]$DatasetID)

When forming the counts component, the TaxonName may not be unique. For example, in this data set Lycopodium tablets occurs twice in TaxonName, differentiated by the units field. However, we wish to crosstab on the TaxonName variable. When we do that, dcast() (and cast() before it) would return the data using fun.aggregate = length - i.e. count how many times each element of TaxonName was present in each sample. This probably went unnoticed because this call was wrapped in suppressMessages() and also perhaps not all data sets have non-unique TaxonName values.

From the example it seems Simon was aware that more than juts Pollen counts would be in the counts component, but if this NeotomaDB doesn't enforce unique values in TaxonName then neotoma needs to handle this. What I did here was pull out only the the rows where TaxaGroup == "Laboratory analyses" and use those for the counts component. Then I added a new component lab.data which pulled out those rows that matched `TaxaGroup == "Laboratory analyses".

This is clearly inelegant - what other values might there be in TaxaGroup? Should we expect to retrieve them all?

I think it is safe to merge this as it gets these functions working again. I will open issues for each of get_publications() and get_download() to host discussions on how to proceed with improvements down the line.