ropensci / rfishbase

R interface to the fishbase.org database
https://docs.ropensci.org/rfishbase
111 stars 40 forks source link

Improve discoverability of tables #52

Open cboettig opened 9 years ago

cboettig commented 9 years ago

As @layamene so effectively put it in her review (#46), most tables are effectively "hidden." Tables should be more discoverable.

FishBase has too many tables. We need:

cboettig commented 9 years ago

Okay, two different approaches here, not sure what is best. Would welcome feedback from @sckott @rBatt @jebyrnes @jafflerbach and others here:

Option 1: "R way"

This is mostly what we've done so far. Each table has a corresponding R function species, ecology etc, which has R-style documentation (so far too sparse) describing the table and its columns. Users learn which tables exist based on higher level documentation like the vignette or mining through the package manual or package namespace, etc.

Option 2: "API way"

Use the heartbeat() function (maybe needs a more intuitive name?) to get a list of all the available tables, and then write a explain() or describe() function which could get a human readable description of the table and it's columns from the API

Option 3: Something better

Perhaps more ideally, we should do a whole bunch more table joins on the server side to reduce the number of endpoints/tables one has to query. This would leave the user more to filter out, but would make discovery easier. I'm very interested if most users prefer "one big table" or "many many smaller tables" approach. I think the latter is more intuitive to when you get started, but turns out to be harder to use (now which table did I put that in again?) unless you're really good at table joins etc, but I'd love to hear more thoughts.

Option 4: something else

Combination of the above or something else entirely. Thoughts anyone?

sckott commented 9 years ago

Sorry to chime in so late here Carl. Seems like we're trying to solve this a bit on the API side by making new routes for finding fields across all tables (/listfields route) and the /docs route for getting details on all tables and more detailed metadata on each table

I'm curious to hear what rfishbase users would like to see from their perspective (i.e., the functions and their parameters exposed in rfishbase) . Is it good to be able to dig in to each table? Or prefer to have a simpler interface and simply get all possible data, then sieve out data you want in R? @rBatt @jebyrnes @jafflerbach @luigi-asprino @mlandis

rBatt commented 9 years ago

I haven't had much time to play around here lately, so my apologies if my suggestion mostly reflects an ignorance.

I like Car's suggestion 3. I tend to prefer "one big table" setups, but in that case it's important to know the logical associations among the columns (e.g., some columns would be "oxygen" columns etc). I guess you could put the data in a "long" format (a column named "variable", a column named "vrbl_category", and one for "value"; handy for analysis, bad for subsetting and memory).

My perspective stems from how I have organized most of my data sets. These would be large surveys with columns for species, location [could be several columns, e.g., region-stratum-lat-lon], date [several columns], weight, length, abundance, etc. I end up with a lot of repeat values (e.g., if multiple individuals caught in same place at same time, only a few columns change across those rows). I would look to rfishbase for additional information to merge() into these data sets to bolster information about the species as a whole; e.g., just like each species has a Genus and species, I would add a 'typical' value for other characteristics like it's typical length, depth, oxygen, temp, latitude, ... etc. That would obviously require aggregation for some tables, but I wouldn't want to miss out on the full set of values either.

I'm not sure what the "simple interface" vs "all possible data" options would look like. In general, I tend to like when objects have the simple stuff on the surface, with the details still present underneath. It says "I'm guessing you want this, but if not, attributes() or str() to see the rest; it's all there" (for examples of this that are very familiar to me, think of model output from functions like lm() or jags(); jags in particular).

I have a feeling I might not be on the same page as you with this topic of organization, but hopefully we can achieve that after a little back-and-forth, if that's helpful to you. So let me know if some of this doesn't make sense, or point to a more specific example of formatting options to help get me on the same page. Or just wait for me to use the package more .... (again, sorry about that).

Thanks for the ongoing great work that you're doing.

jamiecmontgomery commented 9 years ago

For my work I would prefer one big table as well that I can then use to select species and attributes of interest that I can then join to my data (much like @rBatt said).

mlandis commented 9 years ago

I normally prefer to read the entire table once then slice it as needed afterwards, which can sometimes be done with just two commands. Reading small tables then joining needs about one read statement and one join statement per table. So from that, I like Option 2 just because it lends to cleaner and simpler code. I don't know what design considerations you're balancing, so maybe there's some performance hit for passing around large tables.

Is Option 3 getting at the idea of handling tables like

join_us = list(species, ecology)
my_fields = list(c("SpecCode","Vulnerability"), c("SpecCode","FoodTroph"))
my_fish = list('red fish', 'blue fish')
d = make_table(table_list=join_us, species_names=my_fish, fields=my_fields)

where d would join the result from calling species and ecology in a standard way?

On a side note, I wasn't aware of species_fields until I wrote this snippet. Very handy! It looks like there isn't an equivalent ecology_fields, so maybe expanding these xxx_fields shortcuts for other table types would help new rfishbase users (like me) orient themselves.