ropensci / neotoma

Programmatic R interface to the Neotoma Paleoecological Database.
https://docs.ropensci.org/neotoma
Other
30 stars 16 forks source link

get_download throws warning message #44

Closed SimonGoring closed 11 years ago

SimonGoring commented 11 years ago

For some sites (Bondi Section, Billy's Lake and others) the call get_download returns a message: Aggregation function missing: defaulting to length

I'm not entirely sure why, it actually doesn't seem to affect the output, but it would be nice to resolve.

sckott commented 11 years ago

My guess is it happens when you call dcast on lines 154 and 168

gavinsimpson commented 11 years ago

That suggests therefore that there is more than one element per "group" when you are casting. Drop in a browser() somewhere before the first dcast() call and explore the objects you are converting. You could add aggregate.fun = length (I think that is the arg name) which would IIRC count the elements so you can see where the duplicates are that dcast is wanting to aggregate over.

SimonGoring commented 11 years ago

So it actually is screwing things up then, it just looks like it's not. I'd be getting a value anyway, but it would be the wrong one. Okay, thanks for the tip!

gavinsimpson commented 11 years ago

Yeah; something is going wrong earlier if you are expecting one element per cell when you reshape with dcast(). IIRC I faced something similar when I was stripping plyr code out the existing functions. Something that caught me out is that the variable code need not be unique - there are multiple Lycopodium entries in some tables for example, only differentiated by the units variable. It was a pain to work with - I ended up concatenating the variable name with the units into a single string to get round that issue, made worse by not being familiar with the guts of Neotoma (the DB)

SimonGoring commented 11 years ago

Right, that's probably it, but it'll take some time to figure it out. I'm going to leave it for now, but it's obviously a pretty important fix.

SimonGoring commented 11 years ago

For some sites it looks like there can be two 'Unknowns', unknowns that are in the Holocene/late-Glacial sediment and seem to be from that time period, and then 'anachronic' unknowns (usually in the earliest sediments) that appear to be re-worked palynomorphs from earlier sediments (e.g., pre-Pleistocene palynomorphs). In this case 'Variable Context' should sort out the difference, so I'll append it to the taxon name for now.

SimonGoring commented 11 years ago

Variable Context works for some, but at site "Akulinin Exposure" (datasetID = 19) there are a set of Inderterminable grains, with modifiers, but the API isn't returning the modifiers. For now I'm going to set the aggregation to 'sum' since this seems to be the reason for most of the warnings (I've checked a bunch, but it's time consuming), however I'm also checking with the database maintainers to see if we can get the "Modif" field passed through the API.

I'm going to close this issue once I commit, and then re-open a new issue.

gavinsimpson commented 11 years ago

Simon, I think that is quite dangerous unless the detail is explained to the user about what was and was not merged/aggregated. If the DB considers these to be separable units, unique, then netoma could preserve that simply by appending an incrementing integer to the Variable: foo1, foo2, foo3 etc.

The main issue with your proposal is that should the DB change what is returned, neotoma would just silently aggregate things, which is bad.

SimonGoring commented 11 years ago

Okay, I totally get that. I'm going to modify my previous comment.

The problem with your solution is that it's impossible to tell which 'foo' a variable should be. For example, it is possible to have one depth with Inderterminate as taxon name that should have the modifier crumpled, and another with the modifier folded (but the modifiers aren't passed by the API), these would get assigned foo1 and foo2 at that depth.

Since 'modifier' isn't actually passed through the API there's no way of telling which is which, and if only 'folded' is present at another depth, there's no way of knowing whether it should be foo1 or foo2 or some other foo. I've looked at a bunch of the datasets, the duplicates seem to be due to modifiers, so aggregating them shouldn't strongly affect the true counts, it's more likely a 'book-keeping' issue. That said, it's far more satisfying to modify them by passing the information from the API to neotoma than this, somewhat ad hoc method.

So, for now I'm going to suggest that:

  1. I continue with my proposal to sum, until the API passes the Modif field
  2. I pass a warning when these duplicates appear that passes both the number of duplicates and the TaxonName of the duplicated taxa.
gavinsimpson commented 11 years ago

I see what you mean; foo1 for one depth interval could actually correspond to foo2 at another etc. Your new suggestion seems workable. Include some information in the warning about these TaxonName might be the same but the DB distinguishes them via a modifier that is not yet available through the API, otherwise users might wonder what is being done.

SimonGoring commented 11 years ago

Implemented the fix as described in the comment above. Warning messages appear for both lab and count data. I moved charcoal into lab data because it is often affected by this issue.