tidyverse / tibble

A modern re-imagining of the data frame
https://tibble.tidyverse.org/
Other
670 stars 130 forks source link

no rownames? #272

Closed ghost closed 7 years ago

ghost commented 7 years ago

I'm not sure if you have received a lot of feedback for tibble, but I can't fathom why it doesn't support rownames. After searching around, I can't tell if there is a good reason for this, aside from what seems to be a dogmatic and prescriptive view of data analysis. In the tibble vignette, it says:

"It never uses row.names(). The whole point of tidy data is to store variables in a consistent way. So it never stores a variable as special attribute."

And in the documentation, it says:

"Generally, it is best to avoid row names, because they are basically a character column with different semantics to every other column."

But real world datasets do have special variables that you don't want to mix with your data - the names of variables and observations! And you generally want these to have different "semantics" from the rest of your data. Following your logic, why not also drop the variable names from a tibble object and store these as rows? From my perspective, your view is aesthetically pleasing (in an OCD kind of way), but not super realistic for real world data analysis.

Granted, removing the rownames is easy enough:

data %>% select(-rownames)

But isn't this a pretty verbose way of just doing this?

data

The tibble version is implicit and inelegant, while the normal R version is (by contrast) explicit and elegant. In fact, having rownames is a huge convenience because: 1) You don't need to remove them before working with your data 2) You don't need to keep track of them when you subset your matrix/data.frame

For example, to the best of my knowledge, the tibble code to subset a matrix while preserving the rownames is something like this:

data %>% filter(data %>% select(-rownames) %>% rowMeans > 0)

Can you honestly tell me what is going on in this code? It is cumbersome, ugly, and virtually unreadable. In case you couldn't guess, the normal R code is:

data[rowMeans(data) > 0,]

Or even:

i = rowMeans(data) > 0
data[i,]

And I would argue this is in fact elegant and easy to understand.

Hadley has introduced some awesome things to the R universe (ggplot2, %>%, tidyr) and I really appreciate his contributions to data analysis and visualization. But I think tibble, and to a lesser extent, some features of dplyr, miss the mark. In the end, I don't want to type 5 lines of code to do something that is normally a perfectly readable one-liner in R.

krlmlr commented 7 years ago

If the data are in fact a matrix of values, a "long form" might be easier to handle in the tidy data framework:

data_long <-
  data %>%
  gather(key, value, -rownames)

data_long %>%
  group_by(rownames) %>%
  filter(mean(value) > 0) %>%
  ungroup

Row names and key columns achieve the same goal, but the latter give a much simpler data structure. There's no notion of a "row name" in database management systems or, say, CSV or Excel files. Avoiding row names in favor of key columns leads to simpler code and simpler data with better compatibility to other formats.

ghost commented 7 years ago

But this is exactly my point. Isn't it strange that in tidyverse, the best way to work with matrices is to use "long" format, which is completely unsupported by the statistical community? It's not like matrices are seldomly used - they are arguably the most used data structure in all of R. Shouldn't tidyverse provide a natural and elegant way to work with them?

And why doesn't tibble support rownames? If they are difficult to code and you are saving yourselves the work, then I understand. But if you actually believe it's more convenient to not have rownames, then I strongly disagree and I can provide you with several counterexamples where tibble makes for ugly and unintuitive code. Or maybe tibble was only written for simple tasks and you have ceded more complex operations to data.table. In any case, I believe that if you solicit feedback from the statistical community, many many many people would agree, although at this point it's just N = 1 :)

P.S. Even though I'm complaining, I hope you realize that I'm only providing feedback in the hope of making tibble better. You have contributed a ton to the scientific community and I am super grateful for the work you have put into every project!

krlmlr commented 7 years ago

Have a look at tidytext -- even if other data formats might be prevalent in the text mining community, the transformation to and from a "tidy" data format allows using a small number of consistent, well-defined, general purpose tools to solve most problems. Perhaps @JuliaSilge can comment on similar discussions there? Matrices are no different. Have you seen tbl_cube in dplyr?

Row names are unnecessary in the tidy data framework, and they do complicate the implementation of dplyr & co.. The rowMeans() example you gave is a specific function with a narrow scope; the dplyr solution may be a little longer, but it's easy to understand if you know what the basic verbs do. On the other hand, I'd have to look up the documentation for rowMeans().

ghost commented 7 years ago

I looked at tidytext and tbl_cube, but I'm bewildered that these are considered "good" alternatives to matrix format (which again, is perhaps the most widely used data format in existence). There isn't a single statistical package that will accept these formats. Also, since most matrix operations are on rows or columns, doesn't it make sense to store your data as rows and columns?

For the rowMeans() example, I am surprised you think the tidyverse solution is better. While I agree that this is just a "specific function with a narrow scope," the example generalizes to any time you want to subset a matrix based on its values while keeping track of the rownames -- not exactly a rare use case in data science. However, I'm beginning to realize that I'm just not the target user for tidyverse, which would explain my overall befuddlement with the package.

In any case, thanks for taking the time to answer my questions. And I hope you will find the time to look up the rowMeans() documentation when you get a chance - it's a useful function! ;)

krlmlr commented 7 years ago

In your example:

These are some artificial extensions (because you haven't shared your application) which are relatively easy to achieve within the tidy framework, by simply adding a few more steps to the pipe I've shown before. Of course the same can be achieved with a matrix format, but I'd argue that this code would be more complicated and error-prone.

Ultimately, everything depends on the application. In my experience, the tidyverse comes closest to a "one size fits all" solution. In some cases this requires reorganizing your data.

ghost commented 7 years ago

This is base R so I'm assuming you know the solutions to your questions, but here it goes. If the function does not support na.rm, you can do this:

i = rowMeans(na.omit(data)) > 0
data[i,]

Or with magrittr:

i = data %>% na.omit() %>% rowMeans() > 0
data[i,]

If you want to do something else on the rows, it is also fairly straightforward...

i = apply(data, 1, function(x){
    some_function(x) > 0
})
data[i,]

Or with magrittr:

i = apply(data, 1, function(x){
    x %>% some_function() > 0
})
data[i,]

To arrange the values by ascending order:

i = order(rowMeans(data))
data[i,]

Or with magrittr:

i = data %>% rowMeans() %>% order()
data[i,]

Many ways to calculate statistics on the rows. Not 100% sure what you are getting at:

# new column
data$SD = apply(data, 1, sd)

# new matrix
stats = apply(data, 1, function(x){
    c(mean(x), sd(x), quantile(x))
})

# bind matrix
data = cbind(data, t(stats))

# new list
quantiles = lapply(data, quantile)

All of these options are straightforward and elegant. With matrix format, it is very easy to do something like this (how would you do this with other formats?)

quantiles = lapply(t(data), quantile)

The nice thing is these functions are all base R, so anyone can understand them. I didn't have to keep removing and re-attaching my rownames, or keep track of what happened to them. My guess is that with tibble these tasks are much more complicated and more difficult to understand - but I would be interested to see if you can prove me wrong.

I'm only trying to provide you with some fairly common examples where not having rownames is a headache. Please consider this constructive criticism.

juliasilge commented 7 years ago

I am getting the idea from this thread that using tidy data principles doesn't appeal to you, @adnbps, and that is fine for you! For many R users, especially for users who come new to R, the opposite is true, to an extreme degree; the consistent philosophy of tidyverse packages makes for a successful analysis process that promotes good practices. I know in my own regular day-job work where I often am faced with diverse real world datasets, having a toolbox like tidyverse packages allows me to quickly get to analysis solutions. When dealing with text, sometimes this involves casting to a matrix for certain machine learning applications; I don't think most of us who work with tidy data principles think matrices are bad or want to get rid of them. We just aren't in the business of building tools to deal with matrices.

You use the words "dogmatic and prescriptive" above, but another way to think about that might be "consistent and opinionated". You also say that you think tidyr and ggplot2 are great, but the reason they are such useful tools is because they are embrace a consistent API for tidy data, not in spite of it.

If you don't believe all this about others' experiences (especially learning experiences), or maybe think the particulars of these tidy data principles aren't the best way, I expect everyone involved would regret our disagreement but think that is fine. People are different and disagree. Not having row.names on data frames is actually part of the consistent approach here, though; it's one of the features that makes it all work.

krlmlr commented 7 years ago

Thanks @juliasilge, I couldn't have said it better! Just wanted to add the "exercise solutions" as originally intended in tidyverse code. I wanted to demonstrate that all these solutions follow a very similar pattern with only minimal changes to the original code, using basic dplyr verbs only.

  1. Filtering individual NA values:

    data_long %>%
      group_by(rownames) %>%
      filter(!is.na(value)) %>%
      filter(mean(value) > 0) %>%
      ungroup
  2. Filtering outliers (3% from each end):

    data_long %>%
      group_by(rownames) %>%
      filter(mean(value, trim = 0.03) > 0) %>%
      ungroup
  3. Statistics:

    data_long %>%
      group_by(rownames) %>%
      summarize(sd(value), quantile(value, 0.25), quantile(value, 0.75)) %>%
      ungroup
  4. Sorting:

    data_long %>%
      group_by(rownames) %>%
      arrange(value)
ghost commented 7 years ago

@krlmlr Those solutions are fine, but of course, missing the conversions between wide and long format that would be necessary when switching between matrix and tidy operations.

@juliasilge I'm not arguing against the entire tidyverse - just the decision to not support rownames :) Your answer did raise some points that caught my attention:

We just aren't in the business of building tools to deal with matrices.

This is perfectly fine, although with a small change it seems you could support matrices, and therefore reach a much larger fraction of the scientific community.

Not having row.names on data frames is actually part of the consistent approach here, though; it's one of the features that makes it all work.

I am a little confused. Why is it necessary to remove rownames to "make it all work?" Couldn't tibble support rownames and still be consistent? As I pointed out earlier, tibble already supports column names! It seems that tibble works despite not having rownames - not because of it.

And again, are there real-world examples or subtle implementation details that would show why it's better to not have rownames? I have provided several counterexamples to the opposite effect. Why not add support for rownames, then let the filter() function act the same as select()? In fact, this seems a little more consistent to me.

juliasilge commented 7 years ago

If you want just a literal real-world example, there is an implementation in R of calling the US Census API that uses row names to keep track of what kind of values are being stored in each row. I take a firm stance against bashing other developers' work, but this is really awful to work with; every time I need to aggregate, choose certain kinds of rows, or do any basic data manipulation, it is difficult to get at that information. That information belongs in a column (not row names) because it is part of the "observation", at least from the perspective of tidy data principles. In this particular instance, I am really super happy to see the work being developed by Kyle Walker on tidycensus.

That's just one example, but in my experience with real-world datasets, it generalizes. The information in row names is telling you something about the observation so it belongs in a column. Also, rownames in R can't contain duplicates, but for tidy data sets and tidy tools, you usually want to move toward the long, skinny format where you will have duplicates.

library(tidyr)
library(dplyr)

messy <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory"),
  a = c(67, 80, 64),
  b = c(56, 90, 50)
)

messy %>%
  gather(drug, heartrate, a:b)
#>      name drug heartrate
#> 1  Wilbur    a        67
#> 2 Petunia    a        80
#> 3 Gregory    a        64
#> 4  Wilbur    b        56
#> 5 Petunia    b        90
#> 6 Gregory    b        50
ghost commented 7 years ago

@juliasilge In your example, the dataset has actual data in the rownames. Of course this is confusing! But in 99% of cases, the row names are the sample names and the column names are the observation names. Everything else is just a special case. While special cases can often be handled by dropping the rownames (as Tibble does), this shouldn't be the default behavior! What about the 99% of users that just want to do normal matrix operations while keeping track of their rows? You may argue that Tibble is not written for these people - and that's fine, but just know it is a large fraction of total users. Please remember that I am not trying to disparage Tibble at all. Instead, I am trying to give the developers an alternative perspective that perhaps they haven't considered before. It seems like they have considered these points and it is simply a difference in opinion, which is fine.

P.S. You can easily add support for rownames while keeping the rest of Tibble 100% the same. This means Tibble would work exactly the same for you, but it would also work for users who use matrices (again, this is not a "niche" group of people!). This is how base R works, it's how data.table works, it's how Pandas works, it's how Mathematica works, it's how TensorFlow works, it's how Julia works... Did all of these fully developed tools get it wrong?

hadley commented 7 years ago

There are no plans to add support for rownames to tibble at this time. This is because:

Regardless, this feature is not up for discussion, so I don't think it is productive to continue discussion.

krlmlr commented 7 years ago

I just discovered widyr which offers, among other things, an easy way to temporarily transform a tidy dataset into matrix form (with row names!) for use with a function that expects a matrix. Haven't tried it yet, though.

ghost commented 7 years ago

Hadley, thanks for reading! Your "strong opinion" contradicts empirical evidence from many diverse fields - biology, economics, machine learning, time series analysis, linear algebra, physics. As long as you are aware of this, mission accomplished.