statsmaths / cleanNLP

R package providing annotators and a normalized data model for natural language processing
GNU Lesser General Public License v2.1
209 stars 36 forks source link

cnlp_pca() issue: "object 'tfidf' not found" #30

Closed cainesap closed 6 years ago

cainesap commented 6 years ago

Hello,

Firstly, this is a great resource, thank you!

Second, I'm having trouble replicating the PCA analysis from the vignette https://cran.r-project.org/web/packages/cleanNLP/vignettes/case_study.html

The example is as follows:

pca <- cnlp_get_token(sotu) %>%
  filter(pos %in% c("NN", "NNS")) %>%
  cnlp_get_tfidf(min_df = 0.05, max_df = 0.95, type = "tfidf", tf_weight = "dnorm") %$%
  cnlp__pca(tfidf, cnlp_get_document(sotu))

If I try to run this, (minor error is the double underscore in cnlp__pca()) -- I get the following error message:

Error in stats::prcomp(x, center = center, scale. = scale) : 
  object 'tfidf' not found

I'm not sure how to fix this: please could you help? Andrew

statsmaths commented 6 years ago

Great question (and catch on the typo). I just pushed the fix in 4950dbc3c414f7a7b5f0ac6afc171fb64eea9ead. Basically, you just need to do this now because the results of cnlp_get_tfidif now returns just a sparse matrix with row and column names:

pca <- cnlp_get_token(sotu) %>%
  filter(pos %in% c("NN", "NNS")) %>%
  cnlp_get_tfidf(min_df = 0.05, max_df = 0.95, type = "tfidf", tf_weight = "dnorm") %>%
  cnlp_pca(cnlp_get_document(sotu))

And it should work as expected. Please let me know if you still run into any trouble.

cainesap commented 6 years ago

Thank you for the quick response! So.. that error goes away but now I get -- Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric -- which made me wonder if one of my nouns is a number, but I checked this and no. However, I've taken the tfidf object and passed it to plain prcomp(), so this has still been super useful, thanks! Don't worry though, I expect it's a problem with my dataset.. One other question: is it possible to share your plot code for the PCA? It's very good looking!

statsmaths commented 6 years ago

Sure, here is the code that you should be able to adapt to your data:

ggplot(pca, aes(PC1, PC2)) +
  geom_point(aes(color = cut(year, 10, dig.lab = 4)), alpha = 0.35, size = 4) +
  geom_text_repel(data = filter(pca, !duplicated(president)),
                  aes(label = president), color = grey(0.4), cex = 3) +
  labs(color = "Year") +
  scale_color_viridis(discrete=TRUE, end = 0.9, option = "C") +
  theme(axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        axis.text.x = element_blank(),
        axis.text.y = element_blank())