ropensci / RNeXML

Implementing semantically rich NeXML I/O in R
https://docs.ropensci.org/RNeXML
Other
13 stars 9 forks source link

By default give back taxa as a column in character data.frame #89

Closed sckott closed 10 years ago

sckott commented 10 years ago

In this example I wonder if it makes sense to return by default the taxa labels as a column in the data.frame, perhaps the first column. Not a huge deal to move row names to a column, but a bit easier perhaps for downstream use? Maybe not, not sure

comp_analysis <- system.file("examples", "comp_analysis.xml", package="RNeXML")
nex <- nexml_read(comp_analysis)
get_characters(nex)
         log snout-vent length reef-dwelling
taxon_8             -3.2777799             0
taxon_9              2.0959433             1
taxon_10             3.1373971             0
taxon_1              4.7532824             1
taxon_2             -2.7624146             0
taxon_3              2.1049413             0
taxon_4             -4.9504770             0
taxon_5              1.2714718             1
taxon_6              6.2593966             1
taxon_7              0.9099634             1
str(get_characters(nex))
'data.frame':   10 obs. of  2 variables:
 $ log snout-vent length: num  -3.28 2.1 3.14 4.75 -2.76 ...
 $ reef-dwelling        : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 2 2 2
cboettig commented 10 years ago

Yeah, that's a great question.

On one hand, row-names are handy for having very semantic subsetting, e.g.

chars["taxon_5", "reef-dwelling"]

Is rather nicer than

chars[chars[["taxa"]] == "taxon_5", "reef-dwelling"]

On the other hand, like you say, most interesting uses (e.g. ggplot / dplyr manipulations) can't operate on row-names directly and need those names as a column. I also think some R functions can actually cause row labels to be dropped.

However, in this case the motivation comes from what I think is most common in R phylogenetics community. The most popular R packages for handling trait data (like geiger), have the terrible convention of using matrices instead of data.frames as the data object, which forces them to indicate taxa as row names rather than as a column (since the matrices want to be numeric class for all the continuous trait values). By using taxa labels the way I do here, coercion of this "character matrix" data.frame into a matrix object for use in those functions does the thing those functions expect automatically.

In general though, I agree that including 'data' in row-names (e.g. using row-names at all) is probably a bad idea. Wonder if @hadley has a take on that. (I also think many R users would be so much happier if they never learned to use matrix).

hadley commented 10 years ago

IMO, you should never use row names on a data frame (although they do make sense for matrices). Both plyr and dplyr drop row names.

sckott commented 10 years ago

@hadley brings up a good point that dplyr and plyr drop them

I think some of the downstream packages folks will require the species names to be row names, but that's easy enough to do

cboettig commented 10 years ago

@hadley Excellent, I agree. Do you have this carved on a stone tablet somewhere we can wave around?

@sckott Right, in this case we're mostly going with what works best for the downstream functions from the standard packages (which will take a data.frame with row.names but not without it). Perhaps we could add an option like taxa_as_column for the sensible folk that would prefer that...

sckott commented 10 years ago

Most downstream phylogenetic pkgs require datasets with taxa as row names?

hadley commented 10 years ago

@cboettig the tidy data paper is probably the best ref for my thinking, but it doesn't include anything explicitly on row names.

rvosa commented 10 years ago

I think it makes sense to retain row names, especially because in NeXML row names (i.e. row's 'label' attribute) are distinct from taxon names (i.e. otu's 'label'). You can imagine that row names could be things such as FASTA definition lines or some other descriptive name that applies specifically to character data, which users may want to distinguish from proper, 'normalized' taxonomic names.

hadley commented 10 years ago

@rvosa I would still argue that those shouldn't be row names, but explicit variables. One problem with row names is that the don't have their own name/label so you'd need to look up their meaning in the docs.

cboettig commented 10 years ago

@rvosa Note that the data is there either way, this is just a peculiar aspect of R, whereby the data can be stored in "row names" rather than in, say, the first column. At an abstract level this is irrelevant, as a user can go between the formats without information lost, but from a practical matter it impacts the syntax.

I've gone with this otherwise distasteful option because 90% of users will be using this feature for the purpose of getting data into and out of the geiger package, which for historical reasons uses row-names rather than a unique column to contain the data. (Other phylogenetics packages also tend to adopt this convention). Thus a novice user can do:

chars <- get_characters(nex)
trees <- get_trees(nex)
geiger::fitContinuous(trees, chars)    # use function from geiger

(see the end of our README for an example of this.) This just gives users a more compact workflow than first having to add rownames and then drop a column first:

rownames(char) <- char[[1]]
char <- char[[-1]]

Users wanting to promote these rownames to a bona fide column can of course do it manually, with a line or so of code, but I'm adding an option to toggle this:

chars <- get_characters(nex, rownames_as_col=TRUE)

Not sure if that's a good name for the argument (a bit verbose).

sckott commented 10 years ago

@cboettig rownames_as_col is somewhat long, but seems fine to me

I agree that if most users are used to having taxa as row.names on their data frames for phylogenetic work then we should go with that.

rvosa commented 10 years ago

@cboettig https://github.com/cboettig rownames_as_col is somewhat long, but seems fine to me

I agree that if most users are used to having taxa as row.names on their data frames for phylogenetic work then we should go with that.

Does R expect row.names to be unique? Because we can't guarantee that for the 'label' attribute anyway (uniqueness is required of 'id', but not of 'label').

sckott commented 10 years ago

@rvosa Yes , row names have to be unique.