Closed antonmalko closed 2 years ago
Thanks for filing this. It looks like dplyr::mutate()
returns returns rows from truncates the over-classed data.frame equal to the number of columns, regardless of how many rows it actually contains. As you've discovered, one workaround is to coerce it to a plain data.frame using as.data.frame()
. I need to dig deeper to see what in dplyr is causing this.
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 67.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")
dat <- data_corpus_inaugural
tstat <- textstat_readability(data_corpus_inaugural) %>%
head(n = 10)
dplyr::mutate(tstat, new_column = 1)
#> document Flesch new_column
#> 1 1789-Washington 2.034033 1
#> 2 1793-Washington 32.205417 1
#> 3 1797-Adams 0.900731 1
tstat <- textstat_readability(data_corpus_inaugural,
measure = c("FOG", "FOG.PSK")) %>%
head(n = 10)
dplyr::mutate(tstat, new_column = 1)
#> document FOG FOG.PSK new_column
#> 1 1789-Washington 32.90932 18.71394 1
#> 2 1793-Washington 22.09259 11.19467 1
#> 3 1797-Adams 32.91611 18.80304 1
#> 4 1801-Jefferson 23.02834 12.86614 1
tstat <- textstat_readability(data_corpus_inaugural,
measure = c("Flesch", "FOG", "FOG.PSK")) %>%
head(n = 10)
dplyr::mutate(tstat, new_column = 1)
#> document Flesch FOG FOG.PSK new_column
#> 1 1789-Washington 2.034033 32.90932 18.71394 1
#> 2 1793-Washington 32.205417 22.09259 11.19467 1
#> 3 1797-Adams 0.900731 32.91611 18.80304 1
#> 4 1801-Jefferson 31.874466 23.02834 12.86614 1
#> 5 1805-Jefferson 23.385207 25.69839 14.53641 1
# but solved without the additional class labels
as.data.frame(tstat) %>%
dplyr::mutate(new_column = 1)
#> document Flesch FOG FOG.PSK new_column
#> 1 1789-Washington 2.034033 32.90932 18.713936 1
#> 2 1793-Washington 32.205417 22.09259 11.194674 1
#> 3 1797-Adams 0.900731 32.91611 18.803043 1
#> 4 1801-Jefferson 31.874466 23.02834 12.866143 1
#> 5 1805-Jefferson 23.385207 25.69839 14.536413 1
#> 6 1809-Madison 11.582181 29.62381 16.852745 1
#> 7 1813-Madison 31.125809 21.78036 11.620805 1
#> 8 1817-Monroe 40.980057 18.09412 9.215629 1
#> 9 1821-Monroe 36.983083 20.47096 10.869685 1
#> 10 1825-Adams 23.604563 23.73198 12.563136 1
class(tstat)
#> [1] "readability" "textstat" "data.frame"
class(as.data.frame(tstat))
#> [1] "data.frame"
Created on 2022-03-20 by the reprex package (v2.0.1)
I dug around this a little bit more, here are some thoughts...
As you mention, textstat_readability()
output has classes "readability" "textstat" "data.frame". It looks like there is no [.readability
function, so [.textstat
is called. As far as I can see, ultimately it just converts the input to a dataframe and subsets it in the normal way. Before that it handles the missing arguments, and I think this is where the difference in [
behaviour between data.frames and textstat objects arises.
If the goal is to make subsetting identical for data.frames and textstat objects, would it be an option to define [.textstat
like this instead?
`[.textstat` <-
function (x, i, j, ...)
{
NextMethod()
}
I have never played much with S3 and generics, so I don't know whether this approach has pitfalls or unintended consequences somewhere down the line, but here is a little demo. (If you redefine the function like this in the environment, dplyr::mutate()
won't use it for some reason and its output will not change. But I tried changing the source for quanteda.textstat
and recompiling, and then mutate()
was producing the desired output.)
library(quanteda)
#> Warning: package 'quanteda' was built under R version 4.0.5
#> Package version: 3.2.1
#> Unicode version: 10.0
#> ICU version: 61.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
library(quanteda.textstats)
dat <- data_corpus_inaugural
tstat <- textstat_readability(data_corpus_inaugural) %>%
head(n = 6)
# This will return the first row
tstat[1]
#> document Flesch
#> 1 1789-Washington 2.034033
# This will return the first column
as.data.frame(tstat)[1]
#> document
#> 1 1789-Washington
#> 2 1793-Washington
#> 3 1797-Adams
#> 4 1801-Jefferson
#> 5 1805-Jefferson
#> 6 1809-Madison
# Redefine the subsetting function
`[.textstat` <-
function (x, i, j, ...)
{
NextMethod()
}
# Now this will return the first column, as it does for a data.frame
tstat[1]
#> document
#> 1 1789-Washington
#> 2 1793-Washington
#> 3 1797-Adams
#> 4 1801-Jefferson
#> 5 1805-Jefferson
#> 6 1809-Madison
Created on 2022-03-22 by the reprex package (v2.0.1)
I fixed this by removing the [.textstat function altogether - see 0.95.1. It might have been needed in some previous version of R, but it's not needed now. This fixes the issue.
Hi,
While working with someone's code, I stumbled across the following behaviour. Suppose I run
textstat_readability()
on a number of texts, and then try to add a column to the resulting object usingdplyr::mutate()
. The output will only contain C+1 rows, where C is the number of columns in thetextstat_readability()
output. Please see the minimal working example below.As far as I can see, ultimately this happens because of how selection with
[]
works.dplyr::mutate()
calls adplyr_col_select()
function, which uses the following expression to select which columns should be kept:.data[loc]
(.data
is the input object, andloc
is a vector of integers). This is where the problem arises: if.data
is of classdata.frame
,loc
gets interpreted as column indices, and all works correctly. However, if.data
is the output from thetextstat_readability()
function,loc
gets interpreted as the row indices, and only the few initial rows are returned.(I am not sure whether the
[]
behaviour withtextstat_readability()
output is defined by quanteda, so maybe this issue would be better addressed at some other level...)Thank you!
Created on 2022-03-19 by the reprex package (v2.0.1)