Closed cboettig closed 9 years ago
Yeah, could be done manually, though we can easily add a parameter to allow users to select what fields to use in the rbind'ed data.frame? Or we could define a new S3 methods for rbind
to act on the output of occ()
call ? That way occ2df stays the same
Ah, I see there's a lot of heterogeneity in fields returned on the different databases. It seems like a date for the occurance would be common across most of the sources though. standardizing the date format across the different databases and returning a column of those values seems like it would be pretty useful.
still need to wrap my head around the real details of data cleaning in the occurance data; e.g. I don't have any clue what the issues labels refer to in the GBIF data.
Right, the heterogeneity is the limiting bit that I think makes it appealing to give back the common subset of columns. Date is a good one. I could go through and see what others are in common and include those too.
For the issues columns in GBIF data label this might help http://cran.r-project.org/web/packages/rgbif/vignettes/issues_vignette.html - But none of that I don't think can be used here since other data sources do not have the same set of flags.
Just chatted with @sckott about this. You'd think the date of collection would be standard, esp for specimens but is unfortunately not. It's somewhere in AntWeb but not exposed at the moment. We could just return everything from every db and fill the missing one (like plyr's rbind.fill
) with missing values. Would that be a good direction to go?
It would be a bit challenging since some fields (like various morphometric measurements for ant specimens) might make no sense for other types of collections.
Another option is to create a dictionary of common things we're interested in and pull those out into common fields (where available)
Yup, I think missing values is the way to go here. After all, it's not python; R has explicit support for missing data.
On Tue Feb 03 2015 at 3:16:55 PM Karthik Ram notifications@github.com wrote:
Just chatted with @sckott https://github.com/sckott about this. You'd think the date of collection would be standard, esp for specimens but is unfortunately not. It's somewhere in AntWeb but not exposed at the moment. We could just return everything from every db and fill the missing one (like plyr's rbind.fill) with missing values. Would that be a good direction to go? It would be a bit challenging since some fields (like various morphometric measurements for ant specimens) might make no sense for other types of collections.
Another option is to create a dictionary of common things we're interested in and pull those out into common fields (where available)
— Reply to this email directly or view it on GitHub https://github.com/ropensci/spocc/issues/103#issuecomment-72757359.
I'm not sure rbinding all data.frames together blindly is useful for users, since many fields in source A have no analogs in sources B-F, and even if they do they likely aren't named the same thing.
The mapping is what we've been doing so far for occ2df()
, whereas we only had name, lat, long, and source before, just added date now, and we can add more . The
cause TBH it's not that hard for users to rbind stuff together themselves so seems occ2df
should cover the common elements and let users do the other stuff
Right, I think we'd want to focus on specific columns with clear application; and also those which involve more cleaning than a simple rbind. I think date is a good candidate here; I think GBIF returns this data broken out over several different columns(?) (year, month, day), while others return it in a single column. Merging the date columns appropriately and converting into a Date class is thus more useful.
I wonder if there's also a way to capture a reference back to the database source for each row. e.g. if you discover a particular row is an outlier in some way and want to see what is going on, it would be useful to call up the full entry information. Currently the only provenance is the source database.
On Tue Feb 03 2015 at 7:07:42 PM Scott Chamberlain notifications@github.com wrote:
cause TBH it's not that hard for users to rbind stuff together themselves so seems occ2df should cover the common elements and let users do the other stuff
— Reply to this email directly or view it on GitHub https://github.com/ropensci/spocc/issues/103#issuecomment-72782685.
I'm not sure rbinding all data.frames together blindly is useful for users, since many fields in source A have no analogs in sources B-F, and even if they do they likely aren't named the same thing.
Agree :100: with Scott here. That's why I was suggesting curating this somehow so we only pick fields that make sense in a biodiversity informatics context.
Merging the date columns appropriately and converting into a Date class is thus more useful.
This is great and I like this as a general philosophy for ouraggregator packages. The whole idea is that we quietly remove minor annoyances that a novice R user can spend hours and days struggling with. This kind of data cleaning from various sources and getting the dates into a R date class etc.
AFAIK dates for each source are just in one column, so that's easy, but are in many different formats - so converting those to one date format will be useful to users indeed :)
I wonder if there's also a way to capture a reference back to the database source for each row. e.g. if you discover a particular row is an outlier in some way and want to see what is going on, it would be useful to call up the full entry information. Currently the only provenance is the source database.
Good idea, the first thing that comes to mind is there should be a unique occurrence ID for each row of data from each source. I imagine that's the way to go, and not something like row names/numbers as tables can get sorted, etc. Will have a look see
Good idea, the first thing that comes to mind is there should be a unique occurrence ID for each row of data from each source. I imagine that's the way to go, and not something like row names/numbers as tables can get sorted, etc. Will have a look see
Yup, using a unique record id provided by the database would be ideal; I suspect most of them have them. Otherwise I agree that we'd be better off generating a unique id (can use the uuid package or just an indexing value) for each row that would let us index and query the metadata associated with that row quickly.
— Reply to this email directly or view it on GitHub https://github.com/ropensci/spocc/issues/103#issuecomment-72888968.
Ones that I'm aware of don't return unique ids. We're better off relying on uuid
.
i think this is done, returning now
e.g.,
out <- occ(query='Accipiter striatus', from=c('gbif','bison','ecoengine','ebird','inat','vertnet'),
gbifopts=list(hasCoordinate=TRUE), limit=2)
occ2df(out)
name longitude latitude prov date key
1 Accipiter striatus -97.1993 32.86027 gbif 2014-01-25 23:00:00 891038901
2 Accipiter striatus -76.33708 42.25353 gbif 2014-01-06 23:00:00 1037859368
3 Accipiter striatus -82.8881 34.7258 bison 2011-10-04 00:00:00 576953110
4 Accipiter striatus -76.76009 34.72581 bison <NA> 226701218
5 Accipiter striatus -74.078064 41.596959 inat 2015-03-15 00:00:00 1335684
6 Accipiter striatus -103.9645328715 20.7044744831 inat 2015-03-06 00:00:00 1298570
7 Accipiter striatus -75.5065441 40.3698802 ebird 2015-03-24 15:06:00 L129824
8 Accipiter striatus -68.6627269 44.8845268 ebird 2015-03-24 15:00:00 L2589205
9 Accipiter striatus velox -92.1067886 46.7832375 ecoengine 1894-09-17 00:00:00 LACM:Birds:5888
10 Accipiter striatus velox -92.1067886 46.7832375 ecoengine 1894-09-15 00:00:00 LACM:Birds:5892
11 Accipiter striatus -74.686217 41.898178 vertnet 2009-07-11 00:00:00 urn:catalog:AMNH:Birds:SKIN-836988
12 Accipiter striatus -117.179457 46.728691 vertnet 1935-05-07 00:00:00 urn:catalog:CRCM:Birds:07-51
It would be great if
occ2df
could return more fields, such as date of the observation. I suppose this can always be done manually at the moment from the occ$data field.