Open letitbk opened 3 years ago
Hi @kbenoit @kmunger @ArthurSpirling I called you here just in case this issue request has not been seen by you. Thanks!!
Thanks for pointing this out @letitbk, I will look into it.
Could you supply the "short" text as an example or a data object link so I can test this exactly? thanks!
We're working on a quanteda v3 release so I have meant to test the sophistication package with that, to make sure it's working, and now I have a double reason.
Thanks for looking at this. And here I created some reproducible examples (some of them are actually real comments on Facebook pages).
library(quanteda)
library(sophistication)
data(data_BTm_bms)
corp = c('This is an example',
'TRUMP2016','Trumpf0U009FU0087U00BAf0U009FU0087U00B8f0U009FU0092U00AF',
'Awwwwf0U009FU0098U0085f0U009FU0098U0085f0U009FU0098U0085f0U009FU0098U0085',
'JackieRIP','FrankRIP','VoteTrump16')
names(corp) = paste0('doc',1:7)
pred_sophistication <- predict_readability(data_BTm_bms,
newdata = corp,
bootstrap_n = 0)
For now, I actually dropped the corp in advance so that I don't see any error message; Here's my tentative solution.
drop_corp = function(corp){
corp_drop <- data.table(spacyr::spacy_parse(texts(corp), tag = FALSE, lemma = FALSE, entity = FALSE, dependency = FALSE))
# remove punctuation
corp_drop <- corp_drop[pos != "PUNCT" & pos != "SPACE"]
selected_doc = corp_drop[,unique(doc_id)]
return(corp[names(corp) %in% selected_doc])
}
corp = drop_corp(corp)
Hi - I've investigated this now, and the problem comes from exactly what you have identified: "TRUMP2016" for instance being tagged as "PUNCT". Since we remove these elements, and the "text" consists solely of that element, that document is not returned with anything predicted. But the way that this is re-integrated with the original document list, it's breaking because it does not expect any blank predictions. The more robust and correct solution would be to return these as NA. That only makes sense, because there is an undefined readability for "!" or "FrankRIP". (Although: spaCy should really not be identifying these as PUNCT.)
I've just updated sophistication for the new quanteda v3, will work on this fix next.
Hi,
Thanks for making and maintaining your important package. I've encountered the following error when I applied your package to some short texts.
And I found that the issue is caused by removing the punctuation in the following code in
get_covars_new.corpus
function.Basically, for some cases, it removed some documents without a meaningful text and so there will be a mismatch between the final output and the input text file. I think that it will be better if you assign "NA" to these cases instead of just removing them, or at the very least, I hope that you can make it work for some short texts.
Thanks again for making this important package.