statsmaths / cleanNLP

R package providing annotators and a normalized data model for natural language processing
GNU Lesser General Public License v2.1
209 stars 36 forks source link

Error in cbind_all(x) : Argument 2 must be length 514, not 361 #40

Closed amyhuntington closed 5 years ago

amyhuntington commented 5 years ago

Hello!

I've been following your state of the union vignette closely and have really enjoyed what you've created!

I found this discussion: https://github.com/statsmaths/cleanNLP/issues/30 and have utilized your suggestions, however I too am having trouble with the pca part of the analysis and cannot find a good solution.

Here's the code:

pca <- cnlp_get_token(spacy_annotation) %>% filter(pos %in% c("NN","NNS")) %>% cnlp_get_tfidf(min_df = 0.05, max_df = 0.95, type = "tfidf", tf_weight = "dnorm") %>% cnlp_pca(cnlp_get_document(spacy_annotation))

Here's the error:

Error in cbind_all(x) : Argument 2 must be length 514, not 361

I've searched for a cbind_all solution but am coming up short. 514 is the original number of rows from spacy_annotation$document. 361 X 15 is the tfidf matrix.

What should I do?

Thanks!!

amyhuntington commented 5 years ago

I should add, I am not using the obama/sotu data, I am using my own data. I suspect the issue comes from the fact that not every id/document contains a NN or NNS, thus resulting in less documents than the original. This seems like its going to be a fairly common use-case though.

Thanks again.

statsmaths commented 5 years ago

You are completely correct that the issue comes from the fact that not all of your documents are included after the filtering, so the TF-IDF matrix does not have enough rows. You're also correct that this is a common problem, however there is not a particularly clean way of dealing with this due to the way that data are being handled in cleanNLP at the moment.

For your specific case, here is a minimal working example where one document is removed in the filter command and how to deal with it:

library(cleanNLP)
library(dplyr)

docs <- c("Hello here is simple example.", "Same here!",
          "See here too.")

cnlp_init_spacy()
spacy_annotation <- cnlp_annotate(docs)

tfidf <- cnlp_get_token(spacy_annotation) %>%
  filter(upos %in% c("VERB","NOUN")) %>%
  cnlp_utils_tfidf(min_df = -1, max_df = 2, type = "tfidf",
                   tf_weight = "dnorm")

meta <- filter(cnlp_get_document(spacy_annotation), id %in% rownames(tfidf))
pca <- cnlp_utils_pca(tfidf, meta)
pca
# A tibble: 2 x 7
  id    time       version language   uri                             PC1   PC2
  <chr> <chr>      <chr>   <chr>      <chr>                         <dbl> <dbl>
1 doc1  2018-11-0… 2.0.11  en_core_w… /var/folders/9b/0fj3dzqd4l70… -1.22     0
2 doc3  2018-11-0… 2.0.11  en_core_w… /var/folders/9b/0fj3dzqd4l70…  1.22     0

Note that I am using the version from GitHub (2.0.4), not the one on CRAN. A few of the functions have been updated and you may need to update to get this example working.

amyhuntington commented 5 years ago

This is FANTASTIC and pretty simple to work through. I really appreciate your help and attention on this.

I have paired your cleanNLP approach with sentimentr. The results are incredibly illuminating and I'm at the stage where I'd like to train spaCy for NER specific to the population's common entities.

If I train a spacy model in python, do you anticipate any issues calling it (in the initializing step by name, I assume) once its ready to be used in cleanNLP?

I know this is off topic, happy to start another thread. Have you trained any spaCy models and called them with cleanNLP?

statsmaths commented 5 years ago

No, I actually haven't tried to do that with cleanNLP. It should work fine though. Please open a new issue if you try it and it causes you trouble. I'm curious how it goes!

fahadshery commented 5 years ago

@amyhuntington could you share your workflow please? perhaps share it on github?