a problem for short texts

letitbk commented 3 years ago

Hi,

Thanks for making and maintaining your important package. I've encountered the following error when I applied your package to some short texts.

Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
Calls: %>% ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-

And I found that the issue is caused by removing the punctuation in the following code in get_covars_new.corpus function.

    # remove punctuation
    result <- result[pos != "PUNCT" & pos != "SPACE"]

Basically, for some cases, it removed some documents without a meaningful text and so there will be a mismatch between the final output and the input text file. I think that it will be better if you assign "NA" to these cases instead of just removing them, or at the very least, I hope that you can make it work for some short texts.

Thanks again for making this important package.


get_covars_new.corpus <- function(x, baseline_year = 2000, verbose = FALSE) {
    google_min <- pos <- `:=` <- nchars <- token <- sentence_id <- years <- NULL
    doc_id <- .N <- NULL

    if (verbose) message("   ...tagging parts of speech")
    suppressMessages(
        spacyr::spacy_initialize()
    )
    result <- data.table(spacyr::spacy_parse(texts(x), tag = FALSE, lemma = FALSE, entity = FALSE, dependency = FALSE))
    # remove punctuation
    result <- result[pos != "PUNCT" & pos != "SPACE"]

    # if years is a vector, repeat for each token
    if (length(baseline_year) > 1)
        baseline_year <- rep(baseline_year,
                             result[, list(years = length(sentence_id)), by = doc_id][, years])

    if (verbose) message("   ...computing word lengths in characters")
    result[, nchars := stringi::stri_length(token)]

    if (verbose) message("   ...computing baselines from Google frequencies")
    bl_google <- 
        suppressWarnings(make_baselines_google(result$token, baseline_word = "the",
                                               baseline_year = baseline_year)[, 2])
    result[, google_min := bl_google]

    if (verbose) message("   ...aggregating to sentence level")
    result[,
           list(doc_id = doc_id[1],
                n_noun = sum(pos == "NOUN", na.rm = TRUE),
                n_chars = sum(nchars, na.rm = TRUE),
                google_min = min(google_min, na.rm = TRUE),
                n_token = .N),
           by = c("sentence_id", "doc_id")]
}```

letitbk commented 3 years ago

Hi @kbenoit @kmunger @ArthurSpirling I called you here just in case this issue request has not been seen by you. Thanks!!

kbenoit commented 3 years ago

Thanks for pointing this out @letitbk, I will look into it.

Could you supply the "short" text as an example or a data object link so I can test this exactly? thanks!

We're working on a quanteda v3 release so I have meant to test the sophistication package with that, to make sure it's working, and now I have a double reason.

letitbk commented 3 years ago

Thanks for looking at this. And here I created some reproducible examples (some of them are actually real comments on Facebook pages).

library(quanteda)
library(sophistication)
data(data_BTm_bms)
corp = c('This is an example',
  'TRUMP2016','Trumpf0U009FU0087U00BAf0U009FU0087U00B8f0U009FU0092U00AF',
  'Awwwwf0U009FU0098U0085f0U009FU0098U0085f0U009FU0098U0085f0U009FU0098U0085',
  'JackieRIP','FrankRIP','VoteTrump16')
names(corp) = paste0('doc',1:7)
pred_sophistication <- predict_readability(data_BTm_bms, 
                                newdata = corp, 
                                bootstrap_n = 0)

For now, I actually dropped the corp in advance so that I don't see any error message; Here's my tentative solution.

drop_corp = function(corp){
    corp_drop <- data.table(spacyr::spacy_parse(texts(corp), tag = FALSE, lemma = FALSE, entity = FALSE, dependency = FALSE))

    # remove punctuation
    corp_drop <- corp_drop[pos != "PUNCT" & pos != "SPACE"]

    selected_doc = corp_drop[,unique(doc_id)]

    return(corp[names(corp) %in% selected_doc])
  }

  corp = drop_corp(corp)

kbenoit commented 3 years ago

Hi - I've investigated this now, and the problem comes from exactly what you have identified: "TRUMP2016" for instance being tagged as "PUNCT". Since we remove these elements, and the "text" consists solely of that element, that document is not returned with anything predicted. But the way that this is re-integrated with the original document list, it's breaking because it does not expect any blank predictions. The more robust and correct solution would be to return these as NA. That only makes sense, because there is an undefined readability for "!" or "FrankRIP". (Although: spaCy should really not be identifying these as PUNCT.)

I've just updated sophistication for the new quanteda v3, will work on this fix next.

kbenoit / sophistication

a problem for short texts #20