ropensci / spocc

Species occurrence data toolkit for R
https://docs.ropensci.org/spocc
Other
116 stars 27 forks source link

Return additional fields from occ2df? #103

Closed cboettig closed 9 years ago

cboettig commented 9 years ago

It would be great if occ2df could return more fields, such as date of the observation. I suppose this can always be done manually at the moment from the occ$data field.

sckott commented 9 years ago

Yeah, could be done manually, though we can easily add a parameter to allow users to select what fields to use in the rbind'ed data.frame? Or we could define a new S3 methods for rbind to act on the output of occ() call ? That way occ2df stays the same

cboettig commented 9 years ago

Ah, I see there's a lot of heterogeneity in fields returned on the different databases. It seems like a date for the occurance would be common across most of the sources though. standardizing the date format across the different databases and returning a column of those values seems like it would be pretty useful.

still need to wrap my head around the real details of data cleaning in the occurance data; e.g. I don't have any clue what the issues labels refer to in the GBIF data.

sckott commented 9 years ago

Right, the heterogeneity is the limiting bit that I think makes it appealing to give back the common subset of columns. Date is a good one. I could go through and see what others are in common and include those too.

For the issues columns in GBIF data label this might help http://cran.r-project.org/web/packages/rgbif/vignettes/issues_vignette.html - But none of that I don't think can be used here since other data sources do not have the same set of flags.

karthik commented 9 years ago

Just chatted with @sckott about this. You'd think the date of collection would be standard, esp for specimens but is unfortunately not. It's somewhere in AntWeb but not exposed at the moment. We could just return everything from every db and fill the missing one (like plyr's rbind.fill) with missing values. Would that be a good direction to go? It would be a bit challenging since some fields (like various morphometric measurements for ant specimens) might make no sense for other types of collections.

Another option is to create a dictionary of common things we're interested in and pull those out into common fields (where available)

cboettig commented 9 years ago

Yup, I think missing values is the way to go here. After all, it's not python; R has explicit support for missing data.

On Tue Feb 03 2015 at 3:16:55 PM Karthik Ram notifications@github.com wrote:

Just chatted with @sckott https://github.com/sckott about this. You'd think the date of collection would be standard, esp for specimens but is unfortunately not. It's somewhere in AntWeb but not exposed at the moment. We could just return everything from every db and fill the missing one (like plyr's rbind.fill) with missing values. Would that be a good direction to go? It would be a bit challenging since some fields (like various morphometric measurements for ant specimens) might make no sense for other types of collections.

Another option is to create a dictionary of common things we're interested in and pull those out into common fields (where available)

— Reply to this email directly or view it on GitHub https://github.com/ropensci/spocc/issues/103#issuecomment-72757359.

sckott commented 9 years ago

I'm not sure rbinding all data.frames together blindly is useful for users, since many fields in source A have no analogs in sources B-F, and even if they do they likely aren't named the same thing.

The mapping is what we've been doing so far for occ2df(), whereas we only had name, lat, long, and source before, just added date now, and we can add more . The I think could just be those columns that have at last some (at least 1?) analogs in other data sources

sckott commented 9 years ago

cause TBH it's not that hard for users to rbind stuff together themselves so seems occ2df should cover the common elements and let users do the other stuff

cboettig commented 9 years ago

Right, I think we'd want to focus on specific columns with clear application; and also those which involve more cleaning than a simple rbind. I think date is a good candidate here; I think GBIF returns this data broken out over several different columns(?) (year, month, day), while others return it in a single column. Merging the date columns appropriately and converting into a Date class is thus more useful.

I wonder if there's also a way to capture a reference back to the database source for each row. e.g. if you discover a particular row is an outlier in some way and want to see what is going on, it would be useful to call up the full entry information. Currently the only provenance is the source database.

On Tue Feb 03 2015 at 7:07:42 PM Scott Chamberlain notifications@github.com wrote:

cause TBH it's not that hard for users to rbind stuff together themselves so seems occ2df should cover the common elements and let users do the other stuff

— Reply to this email directly or view it on GitHub https://github.com/ropensci/spocc/issues/103#issuecomment-72782685.

karthik commented 9 years ago

I'm not sure rbinding all data.frames together blindly is useful for users, since many fields in source A have no analogs in sources B-F, and even if they do they likely aren't named the same thing.

Agree :100: with Scott here. That's why I was suggesting curating this somehow so we only pick fields that make sense in a biodiversity informatics context.

Merging the date columns appropriately and converting into a Date class is thus more useful.

This is great and I like this as a general philosophy for ouraggregator packages. The whole idea is that we quietly remove minor annoyances that a novice R user can spend hours and days struggling with. This kind of data cleaning from various sources and getting the dates into a R date class etc.

sckott commented 9 years ago

AFAIK dates for each source are just in one column, so that's easy, but are in many different formats - so converting those to one date format will be useful to users indeed :)

I wonder if there's also a way to capture a reference back to the database source for each row. e.g. if you discover a particular row is an outlier in some way and want to see what is going on, it would be useful to call up the full entry information. Currently the only provenance is the source database.

Good idea, the first thing that comes to mind is there should be a unique occurrence ID for each row of data from each source. I imagine that's the way to go, and not something like row names/numbers as tables can get sorted, etc. Will have a look see

cboettig commented 9 years ago

Good idea, the first thing that comes to mind is there should be a unique occurrence ID for each row of data from each source. I imagine that's the way to go, and not something like row names/numbers as tables can get sorted, etc. Will have a look see

Yup, using a unique record id provided by the database would be ideal; I suspect most of them have them. Otherwise I agree that we'd be better off generating a unique id (can use the uuid package or just an indexing value) for each row that would let us index and query the metadata associated with that row quickly.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/spocc/issues/103#issuecomment-72888968.

karthik commented 9 years ago

Ones that I'm aware of don't return unique ids. We're better off relying on uuid.

sckott commented 9 years ago

i think this is done, returning now

e.g.,

out <- occ(query='Accipiter striatus', from=c('gbif','bison','ecoengine','ebird','inat','vertnet'), 
   gbifopts=list(hasCoordinate=TRUE), limit=2)
occ2df(out)
                       name       longitude      latitude      prov                date                                key
1        Accipiter striatus        -97.1993      32.86027      gbif 2014-01-25 23:00:00                          891038901
2        Accipiter striatus       -76.33708      42.25353      gbif 2014-01-06 23:00:00                         1037859368
3        Accipiter striatus        -82.8881       34.7258     bison 2011-10-04 00:00:00                          576953110
4        Accipiter striatus       -76.76009      34.72581     bison                <NA>                          226701218
5        Accipiter striatus      -74.078064     41.596959      inat 2015-03-15 00:00:00                            1335684
6        Accipiter striatus -103.9645328715 20.7044744831      inat 2015-03-06 00:00:00                            1298570
7        Accipiter striatus     -75.5065441    40.3698802     ebird 2015-03-24 15:06:00                            L129824
8        Accipiter striatus     -68.6627269    44.8845268     ebird 2015-03-24 15:00:00                           L2589205
9  Accipiter striatus velox     -92.1067886    46.7832375 ecoengine 1894-09-17 00:00:00                    LACM:Birds:5888
10 Accipiter striatus velox     -92.1067886    46.7832375 ecoengine 1894-09-15 00:00:00                    LACM:Birds:5892
11       Accipiter striatus      -74.686217     41.898178   vertnet 2009-07-11 00:00:00 urn:catalog:AMNH:Birds:SKIN-836988
12       Accipiter striatus     -117.179457     46.728691   vertnet 1935-05-07 00:00:00       urn:catalog:CRCM:Birds:07-51