quanteda / quanteda.textstats

Textual statistics for quanteda
GNU General Public License v3.0
14 stars 2 forks source link

MATTR calculation defaults to wrong window length if provided window length exceeds document length #60

Open mweylandt opened 1 year ago

mweylandt commented 1 year ago

Hello,

apololgies if my issue is based on a misunderstanding.

When I use textstat_lexdiv to calculate MATTR, and a document is shorter than the MATTR_window specified as an argument to the function, the function throws an error.

This is because the function (compute_mattr) checks for this case, and resets the MATTR_window value to the longest document in the corpus. Using this value in the tokens_ngrams function down the line creates a list with empty entries, which trips up the calculation of the TTR and causes the error.

I believe the window should be set to the shortest document in the corpus -- as MATTR is calculated by averaging the TTRs of a moving window across the document, it seems reasonable for that window to be the length of the shortest document. An alternative would be rewriting it so it returns NA for the documents that are too short to calculate this value.

Reproducible Example

txt <- c("Anyway, like I was sayin', shrimp is the fruit of the sea. You can
          barbecue it, boil it, broil it, bake it, saute it.",
         "There's shrimp-kabobs,
          shrimp creole, shrimp gumbo. Pan fried, deep fried, stir-fried. There's
          pineapple shrimp, lemon shrimp, coconut shrimp, pepper shrimp, shrimp soup,
          shrimp stew, shrimp salad, shrimp and potatoes, shrimp burger, shrimp
          sandwich.")
tokens(txt) %>%
  textstat_lexdiv(measure = c("TTR", "CTTR", "K", "MATTR"))

# Error in textstat_lexdiv.dfm(dfm(tokens(y)), "TTR") : 
#  dfm must have at least one non-zero value
# In addition: Warning message:
# MATTR_window exceeds some documents' token lengths, resetting to 33 

More Details

I'm including the original function below, with suggested fix in comments

function (x, MATTR_window = 100L) 
{
  if (MATTR_window < 1) 
    stop("MATTR_window must be positive")
  if (any(ntoken(x) < MATTR_window)) {
    MATTR_window <- max(ntoken(x)) # this should be min(ntoken(x))
    warning("MATTR_window exceeds some documents' token lengths, resetting to ", 
            MATTR_window, call. = FALSE)
  }
  x <- tokens_ngrams(x, n = MATTR_window, concatenator = " ")  
  temp <- lapply(as.list(x), function(y) textstat_lexdiv(dfm(tokens(y)), 
                                                         "TTR")[["TTR"]])
  result <- unlist(lapply(temp, mean))
  return(result)
}

Again, if I've misunderstood any conceptual issue (which may well be, as the same process is applied to MSTTR), apologies -- new to these text diversity measures. If not, happy to do a pull request if that saves you some time!

kbenoit commented 1 year ago

Thanks for pointing this out, I'll fix it asap.

mweylandt commented 1 year ago

Hello, thanks for responding!

I've thought about this some more and the fix I suggested may also be inadequate. One could come across a case (like I have recently) where there is a very wide range of document lengths. In practical terms, the calculation could default to using a window size of 1 or 2 for calculations, which would render MATTR and MSTTR meaningless as well. I wonder if it would make sense to write it such that any documents with fewer tokens than the window width simply don't get a MATTR/MSTTR rather than the one based on a window of the minimum document length.

Would appreciate your thoughts, and be happy to assist if it is possible to do so! Thanks for a great set of tools.

kbenoit commented 1 year ago

That's a good idea - set a minimum document length below which a document has an NA returned for a moving average measure.

mweylandt commented 11 months ago

I thought I would share how I ended up doing it for my project, in case it's helpful.

I simply check whether the dfm is empty in the function that calculates MATTR, and then return NA for it.

compute_mattr<- function (x, MATTR_window = 100L, min_window = 5L) 
{
  if (MATTR_window < 1) 
    stop("MATTR_window must be positive")
  if (any(ntoken(x) < MATTR_window)) {
    MATTR_window <- min_window
    warning("MATTR_window exceeds some documents' token lengths, resetting to minimum window size: ", 
            min_window, call. = FALSE)
  }
  if (any(ntoken(x) < min_window)) {
    warning("min_window exceeds some documents' token lengths, these documents will return NA", 
            call. = FALSE)
  }

  x <- tokens_ngrams(x, n = MATTR_window, concatenator = " ")  

# check whether the dfm is empty and return NA, else go on as previously
  check_dfm <- function(y){
      txdfm <-dfm(tokens(y))
      if(!sum(txdfm)) return(NA)
      quanteda.textstats::textstat_lexdiv(txdfm, "TTR")[["TTR"]]
    }

  temp <- lapply(as.list(x), check_dfm)
  result <- unlist(lapply(temp, mean))

  return(result)
}

txt <- c("fish sticks",
         "Anyway, like I was sayin', shrimp is the fruit of the sea. You can
          barbecue it, boil it, broil it, bake it, saute it.",
         "There's shrimp-kabobs,
          shrimp creole, shrimp gumbo. Pan fried, deep fried, stir-fried. There's
          pineapple shrimp, lemon shrimp, coconut shrimp, pepper shrimp, shrimp soup,
          shrimp stew, shrimp salad, shrimp and potatoes, shrimp burger, shrimp
          sandwich.")

toks <- tokens(txt)

#> compute_mattr(toks, MATTR_window = 35, min_window = 5)
#    text1     text2     text3 
#       NA 0.9057471 0.8574074 
# Warning messages:
# 1: MATTR_window exceeds some documents' token lengths, resetting to minimum window size: 5 
# 2: min_window exceeds some documents' token lengths, these documents will return NA 
> 

I worried that these checks would slow the function down on large corpora but in my (limited) tests it seems fine.

The other alternative is to allow textstat_lexdiv to pass an empty dfm along to compute_lexdiv_dfm_stats . Currently it checks and throws an error, and currently also compute_lexdiv_dfm_stats can't handle the empty dfm in any case (in my tests so far).

just thought I'd put this here in case it's helpful.