textstat_readability() output does not work correctly with dplyr::mutate() #50

Closed antonmalko closed 2 years ago

antonmalko commented 2 years ago


While working with someone's code, I stumbled across the following behaviour. Suppose I run textstat_readability() on a number of texts, and then try to add a column to the resulting object using dplyr::mutate(). The output will only contain C+1 rows, where C is the number of columns in the textstat_readability() output. Please see the minimal working example below.

As far as I can see, ultimately this happens because of how selection with [] works. dplyr::mutate() calls a dplyr_col_select() function, which uses the following expression to select which columns should be kept: .data[loc] (.data is the input object, and loc is a vector of integers). This is where the problem arises: if .data is of class data.frame, loc gets interpreted as column indices, and all works correctly. However, if .data is the output from the textstat_readability() function, loc gets interpreted as the row indices, and only the few initial rows are returned.

(I am not sure whether the [] behaviour with textstat_readability() output is defined by quanteda, so maybe this issue would be better addressed at some other level...)

Thank you!

dat <- data_corpus_inaugural

read <- textstat_readability(dat, measure = c("FOG", "FOG.PSK"))

# Only leave the initial 6 documents to reduce the size of correct output
read <- head(read)

# Wrong: returns only 4 rows, while it should return 6
read %>%
  dplyr::mutate(new_column = 1)
#>          document      FOG  FOG.PSK new_column
#> 1 1789-Washington 32.90932 18.71394          1
#> 2 1793-Washington 22.09259 11.19467          1
#> 3      1797-Adams 32.91611 18.80304          1
#> 4  1801-Jefferson 23.02834 12.86614          1

# Correct: returns all 6 rows
read %>%
  as.data.frame() %>%
  dplyr::mutate(new_column = 1)
#>          document      FOG  FOG.PSK new_column
#> 1 1789-Washington 32.90932 18.71394          1
#> 2 1793-Washington 22.09259 11.19467          1
#> 3      1797-Adams 32.91611 18.80304          1
#> 4  1801-Jefferson 23.02834 12.86614          1
#> 5  1805-Jefferson 25.69839 14.53641          1
#> 6    1809-Madison 29.62381 16.85274          1

kbenoit commented 2 years ago

Thanks for filing this. It looks like dplyr::mutate() returns returns rows from truncates the over-classed data.frame equal to the number of columns, regardless of how many rows it actually contains. As you've discovered, one workaround is to coerce it to a plain data.frame using as.data.frame(). I need to dig deeper to see what in dplyr is causing this.

dat <- data_corpus_inaugural

tstat <- textstat_readability(data_corpus_inaugural) %>%
    head(n = 10)
dplyr::mutate(tstat, new_column = 1)
#>          document    Flesch new_column
#> 1 1789-Washington  2.034033          1
#> 2 1793-Washington 32.205417          1
#> 3      1797-Adams  0.900731          1

tstat <- textstat_readability(data_corpus_inaugural, 
                              measure = c("FOG", "FOG.PSK")) %>%
    head(n = 10)
dplyr::mutate(tstat, new_column = 1)
#>          document      FOG  FOG.PSK new_column
#> 1 1789-Washington 32.90932 18.71394          1
#> 2 1793-Washington 22.09259 11.19467          1
#> 3      1797-Adams 32.91611 18.80304          1
#> 4  1801-Jefferson 23.02834 12.86614          1

tstat <- textstat_readability(data_corpus_inaugural, 
                              measure = c("Flesch", "FOG", "FOG.PSK")) %>%
    head(n = 10)
dplyr::mutate(tstat, new_column = 1)
#>          document    Flesch      FOG  FOG.PSK new_column
#> 1 1789-Washington  2.034033 32.90932 18.71394          1
#> 2 1793-Washington 32.205417 22.09259 11.19467          1
#> 3      1797-Adams  0.900731 32.91611 18.80304          1
#> 4  1801-Jefferson 31.874466 23.02834 12.86614          1
#> 5  1805-Jefferson 23.385207 25.69839 14.53641          1

# but solved without the additional class labels
as.data.frame(tstat) %>%
    dplyr::mutate(new_column = 1)
#>           document    Flesch      FOG   FOG.PSK new_column
#> 1  1789-Washington  2.034033 32.90932 18.713936          1
#> 2  1793-Washington 32.205417 22.09259 11.194674          1
#> 3       1797-Adams  0.900731 32.91611 18.803043          1
#> 4   1801-Jefferson 31.874466 23.02834 12.866143          1
#> 5   1805-Jefferson 23.385207 25.69839 14.536413          1
#> 6     1809-Madison 11.582181 29.62381 16.852745          1
#> 7     1813-Madison 31.125809 21.78036 11.620805          1
#> 8      1817-Monroe 40.980057 18.09412  9.215629          1
#> 9      1821-Monroe 36.983083 20.47096 10.869685          1
#> 10      1825-Adams 23.604563 23.73198 12.563136          1

#> [1] "readability" "textstat"    "data.frame"
#> [1] "data.frame"

antonmalko commented 2 years ago

I dug around this a little bit more, here are some thoughts...

As you mention, textstat_readability() output has classes "readability" "textstat" "data.frame". It looks like there is no [.readability function, so [.textstat is called. As far as I can see, ultimately it just converts the input to a dataframe and subsets it in the normal way. Before that it handles the missing arguments, and I think this is where the difference in [ behaviour between data.frames and textstat objects arises.

If the goal is to make subsetting identical for data.frames and textstat objects, would it be an option to define [.textstat like this instead?

`[.textstat` <- 
  function (x, i, j, ...) 

I have never played much with S3 and generics, so I don't know whether this approach has pitfalls or unintended consequences somewhere down the line, but here is a little demo. (If you redefine the function like this in the environment, dplyr::mutate() won't use it for some reason and its output will not change. But I tried changing the source for quanteda.textstat and recompiling, and then mutate() was producing the desired output.)

dat <- data_corpus_inaugural

tstat <- textstat_readability(data_corpus_inaugural) %>%
  head(n = 6)

# This will return the first row
#>          document   Flesch
#> 1 1789-Washington 2.034033

# This will return the first column
#>          document
#> 1 1789-Washington
#> 2 1793-Washington
#> 3      1797-Adams
#> 4  1801-Jefferson
#> 5  1805-Jefferson
#> 6    1809-Madison

# Redefine the subsetting function
`[.textstat` <- 
  function (x, i, j, ...) 

# Now this will return the first column, as it does for a data.frame
#>          document
#> 1 1789-Washington
#> 2 1793-Washington
#> 3      1797-Adams
#> 4  1801-Jefferson
#> 5  1805-Jefferson
#> 6    1809-Madison

kbenoit commented 2 years ago

I fixed this by removing the [.textstat function altogether - see 0.95.1. It might have been needed in some previous version of R, but it's not needed now. This fixes the issue.