quanteda / quanteda.textstats

Textual statistics for quanteda
GNU General Public License v3.0
14 stars 2 forks source link

textstat_readability() output does not work correctly with dplyr::mutate() #50

Closed antonmalko closed 2 years ago

antonmalko commented 2 years ago

Hi,

While working with someone's code, I stumbled across the following behaviour. Suppose I run textstat_readability() on a number of texts, and then try to add a column to the resulting object using dplyr::mutate(). The output will only contain C+1 rows, where C is the number of columns in the textstat_readability() output. Please see the minimal working example below.

As far as I can see, ultimately this happens because of how selection with [] works. dplyr::mutate() calls a dplyr_col_select() function, which uses the following expression to select which columns should be kept: .data[loc] (.data is the input object, and loc is a vector of integers). This is where the problem arises: if .data is of class data.frame, loc gets interpreted as column indices, and all works correctly. However, if .data is the output from the textstat_readability() function, loc gets interpreted as the row indices, and only the few initial rows are returned.

(I am not sure whether the [] behaviour with textstat_readability() output is defined by quanteda, so maybe this issue would be better addressed at some other level...)

Thank you!


library(quanteda)
#> Warning: package 'quanteda' was built under R version 4.0.5
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
library(quanteda.textstats)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.5
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dat <- data_corpus_inaugural

read <- textstat_readability(dat, measure = c("FOG", "FOG.PSK"))

# Only leave the initial 6 documents to reduce the size of correct output
read <- head(read)

# Wrong: returns only 4 rows, while it should return 6
read %>%
  dplyr::mutate(new_column = 1)
#>          document      FOG  FOG.PSK new_column
#> 1 1789-Washington 32.90932 18.71394          1
#> 2 1793-Washington 22.09259 11.19467          1
#> 3      1797-Adams 32.91611 18.80304          1
#> 4  1801-Jefferson 23.02834 12.86614          1

# Correct: returns all 6 rows
read %>%
  as.data.frame() %>%
  dplyr::mutate(new_column = 1)
#>          document      FOG  FOG.PSK new_column
#> 1 1789-Washington 32.90932 18.71394          1
#> 2 1793-Washington 22.09259 11.19467          1
#> 3      1797-Adams 32.91611 18.80304          1
#> 4  1801-Jefferson 23.02834 12.86614          1
#> 5  1805-Jefferson 25.69839 14.53641          1
#> 6    1809-Madison 29.62381 16.85274          1

sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS  10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.0.8             quanteda.textstats_0.95 quanteda_3.2.1         
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.8.3       compiler_4.0.2     pillar_1.7.0       highr_0.9         
#>  [5] tools_4.0.2        stopwords_2.3      digest_0.6.29      tibble_3.1.6      
#>  [9] evaluate_0.15      lifecycle_1.0.1    lattice_0.20-41    pkgconfig_2.0.3   
#> [13] rlang_1.0.2        reprex_2.0.1       Matrix_1.2-18      fastmatch_1.1-3   
#> [17] DBI_1.1.2          cli_3.2.0          rstudioapi_0.13    yaml_2.3.5        
#> [21] xfun_0.30          fastmap_1.1.0      withr_2.5.0        stringr_1.4.0     
#> [25] knitr_1.37         generics_0.1.2     fs_1.5.2           vctrs_0.3.8       
#> [29] tidyselect_1.1.2   grid_4.0.2         glue_1.6.2         nsyllable_1.0.1   
#> [33] R6_2.5.1           fansi_1.0.2        rmarkdown_2.13     purrr_0.3.4       
#> [37] magrittr_2.0.2     htmltools_0.5.2    ellipsis_0.3.2     assertthat_0.2.1  
#> [41] utf8_1.2.2         stringi_1.7.6      RcppParallel_5.1.5 crayon_1.5.0

Created on 2022-03-19 by the reprex package (v2.0.1)

kbenoit commented 2 years ago

Thanks for filing this. It looks like dplyr::mutate() returns returns rows from truncates the over-classed data.frame equal to the number of columns, regardless of how many rows it actually contains. As you've discovered, one workaround is to coerce it to a plain data.frame using as.data.frame(). I need to dig deeper to see what in dplyr is causing this.

library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 67.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")

dat <- data_corpus_inaugural

tstat <- textstat_readability(data_corpus_inaugural) %>%
    head(n = 10)
dplyr::mutate(tstat, new_column = 1)
#>          document    Flesch new_column
#> 1 1789-Washington  2.034033          1
#> 2 1793-Washington 32.205417          1
#> 3      1797-Adams  0.900731          1

tstat <- textstat_readability(data_corpus_inaugural, 
                              measure = c("FOG", "FOG.PSK")) %>%
    head(n = 10)
dplyr::mutate(tstat, new_column = 1)
#>          document      FOG  FOG.PSK new_column
#> 1 1789-Washington 32.90932 18.71394          1
#> 2 1793-Washington 22.09259 11.19467          1
#> 3      1797-Adams 32.91611 18.80304          1
#> 4  1801-Jefferson 23.02834 12.86614          1

tstat <- textstat_readability(data_corpus_inaugural, 
                              measure = c("Flesch", "FOG", "FOG.PSK")) %>%
    head(n = 10)
dplyr::mutate(tstat, new_column = 1)
#>          document    Flesch      FOG  FOG.PSK new_column
#> 1 1789-Washington  2.034033 32.90932 18.71394          1
#> 2 1793-Washington 32.205417 22.09259 11.19467          1
#> 3      1797-Adams  0.900731 32.91611 18.80304          1
#> 4  1801-Jefferson 31.874466 23.02834 12.86614          1
#> 5  1805-Jefferson 23.385207 25.69839 14.53641          1

# but solved without the additional class labels
as.data.frame(tstat) %>%
    dplyr::mutate(new_column = 1)
#>           document    Flesch      FOG   FOG.PSK new_column
#> 1  1789-Washington  2.034033 32.90932 18.713936          1
#> 2  1793-Washington 32.205417 22.09259 11.194674          1
#> 3       1797-Adams  0.900731 32.91611 18.803043          1
#> 4   1801-Jefferson 31.874466 23.02834 12.866143          1
#> 5   1805-Jefferson 23.385207 25.69839 14.536413          1
#> 6     1809-Madison 11.582181 29.62381 16.852745          1
#> 7     1813-Madison 31.125809 21.78036 11.620805          1
#> 8      1817-Monroe 40.980057 18.09412  9.215629          1
#> 9      1821-Monroe 36.983083 20.47096 10.869685          1
#> 10      1825-Adams 23.604563 23.73198 12.563136          1

class(tstat)
#> [1] "readability" "textstat"    "data.frame"
class(as.data.frame(tstat))
#> [1] "data.frame"

Created on 2022-03-20 by the reprex package (v2.0.1)

antonmalko commented 2 years ago

I dug around this a little bit more, here are some thoughts...

As you mention, textstat_readability() output has classes "readability" "textstat" "data.frame". It looks like there is no [.readability function, so [.textstat is called. As far as I can see, ultimately it just converts the input to a dataframe and subsets it in the normal way. Before that it handles the missing arguments, and I think this is where the difference in [ behaviour between data.frames and textstat objects arises.

If the goal is to make subsetting identical for data.frames and textstat objects, would it be an option to define [.textstat like this instead?

`[.textstat` <- 
  function (x, i, j, ...) 
  {
    NextMethod()
  }

I have never played much with S3 and generics, so I don't know whether this approach has pitfalls or unintended consequences somewhere down the line, but here is a little demo. (If you redefine the function like this in the environment, dplyr::mutate() won't use it for some reason and its output will not change. But I tried changing the source for quanteda.textstat and recompiling, and then mutate() was producing the desired output.)

library(quanteda)
#> Warning: package 'quanteda' was built under R version 4.0.5
#> Package version: 3.2.1
#> Unicode version: 10.0
#> ICU version: 61.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
library(quanteda.textstats)

dat <- data_corpus_inaugural

tstat <- textstat_readability(data_corpus_inaugural) %>%
  head(n = 6)

# This will return the first row
tstat[1]
#>          document   Flesch
#> 1 1789-Washington 2.034033

# This will return the first column
as.data.frame(tstat)[1]
#>          document
#> 1 1789-Washington
#> 2 1793-Washington
#> 3      1797-Adams
#> 4  1801-Jefferson
#> 5  1805-Jefferson
#> 6    1809-Madison

# Redefine the subsetting function
`[.textstat` <- 
  function (x, i, j, ...) 
  {
    NextMethod()
  }

# Now this will return the first column, as it does for a data.frame
tstat[1]
#>          document
#> 1 1789-Washington
#> 2 1793-Washington
#> 3      1797-Adams
#> 4  1801-Jefferson
#> 5  1805-Jefferson
#> 6    1809-Madison

Created on 2022-03-22 by the reprex package (v2.0.1)

kbenoit commented 2 years ago

I fixed this by removing the [.textstat function altogether - see 0.95.1. It might have been needed in some previous version of R, but it's not needed now. This fixes the issue.