Open koheiw opened 6 years ago
Networks typically to reflect co-occurrence rather than similarity - are there examples of networks to show similarity? Either way it would be easy to write a method for textplot_network() that allows the return items from textstat_simil() as inputs.
A more common plot would be a heatmap of similarity called textplot_simil() that took a similarity input. This is a pretty common way to plot a matrix of correlations/similarities and easily available in ggplot2.
On 28 Oct 2018, at 05:20, Kohei Watanabe notifications@github.com<mailto:notifications@github.com> wrote:
If we treat a similarity matrix as a type of adjacency matrix, we can plot a semantic network using textplot_network() in a few steps. Why don't we make this an official function?
require(quanteda)
mt <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords()) mt <- dfm_trim(mt, min_termfreq = 100) sim <- textstat_proxy(mt, margin = "features") textplot_network(quanteda:::as.fcm(as(sim, "dgTMatrix")), min_freq = 0.95)
[rplot]https://user-images.githubusercontent.com/6572963/47614121-eafee100-dadd-11e8-89de-d9945baa3a5c.png
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/quanteda/quanteda/issues/1474, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACFMZvItd-JZj9RxHUnO19RE6EN2cjwuks5upXbqgaJpZM4X93XH.
Hi, I wanted to chime in on this. I agree with Ken that when thinking about similarity, a heatmap-like visualisation is more intuitive to me.
However, I think that a collocation_network
function might be a useful complementary function to textstat_collocations
. The concept of collocation seems to me to lend itself naturally to a spatial expression. But from a function design point of view, I think there are some non-trivial challenges - e.g. scalability, replicability of visualisations, interactivity etc. Page 37 of this article has a nice overview: Towards Interactive Multidimensional Visualisations for Corpus Linguistics
Is this something we would like to explore?
@kbenoit is there any hacky way to do what you are referring to?
A more common plot would be a heatmap of similarity called textplot_simil() that took a similarity input. This is a pretty common way to plot a matrix of correlations/similarities and easily available in ggplot2.
I wouldn't call it hacky, but the code below works. The simil measures will yield positive values so we would ideally figure out a way to remove the values < 1.0.
library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
simmat <- corpus_subset(data_corpus_inaugural, Year > 1980) %>%
dfm(remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("en")) %>%
textstat_simil() %>%
as.matrix()
simmat[1:5, 1:5]
## 1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton 1997-Clinton
## 1981-Reagan 1.0000000 0.6503200 0.4750618 0.5159960 0.5181002
## 1985-Reagan 0.6503200 1.0000000 0.5043065 0.5558569 0.6074780
## 1989-Bush 0.4750618 0.5043065 1.0000000 0.5037529 0.5311117
## 1993-Clinton 0.5159960 0.5558569 0.5037529 1.0000000 0.5961274
## 1997-Clinton 0.5181002 0.6074780 0.5311117 0.5961274 1.0000000
ggcorrplot::ggcorrplot(simmat, hc.order = TRUE, type = "lower")
corrplot::corrplot.mixed(simmat, order = "hclust", tl.col = "black")
Hope Dr. Benoit can offer some guidance on the following feedback --
install.packages("quanteda.textstat") Installing package into ‘C:/Users/jwang/AppData/Local/R/win-library/4.2’ (as ‘lib’ is unspecified) Warning in install.packages : package ‘quanteda.textstat’ is not available for this version of R
A version of this package for your version of R might be available elsewhere, see the ideas at https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
library(quanteda.textstat) Error in library(quanteda.textstat) : there is no package called ‘quanteda.textstat’
I tried different versions of R. None of the attempts worked. Thank you!
My apology -- I miss "s" in install.packages("quanteda.textstats")
Regarding the following warning messages -- 1: remove_punct, remove_numbers arguments are not used. 2: 'remove' is deprecated; use dfm_remove() instead
I am wondering if Dr. Benoit could provide an example on how to use dfm_remove() to replace remove.
Thank you!
example("dfm_remove", package = "quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> dfm_rm> dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.",
#> dfm_rm+ "Does the United_States or Sweden have more progressive taxation?")) %>%
#> dfm_rm+ dfm(tolower = FALSE)
#>
#> dfm_rm> dict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
#> dfm_rm+ wordsEndingInY = c("by", "my"),
#> dfm_rm+ notintext = "blahblah"))
#>
#> dfm_rm> dfm_select(dfmat, pattern = dict)
#> Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars.
#> features
#> docs My by United_States Sweden
#> text1 1 1 0 0
#> text2 0 0 1 1
#>
#> dfm_rm> dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 1 feature (50.00% sparse) and 0 docvars.
#> features
#> docs by
#> text1 1
#> text2 0
#>
#> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#> features
#> docs My Christmas was by Does United_States
#> text1 1 1 1 1 0 0
#> text2 0 0 0 0 1 1
#>
#> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 14 features (50.00% sparse) and 0 docvars.
#> features
#> docs ruined your opposition tax plan . the or Sweden have
#> text1 1 1 1 1 1 1 0 0 0 0
#> text2 0 0 0 0 0 0 1 1 1 1
#> [ reached max_nfeat ... 4 more features ]
#>
#> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 9 features (50.00% sparse) and 0 docvars.
#> features
#> docs My was by your Does the or have more
#> text1 1 1 1 1 0 0 0 0 0
#> text2 0 0 0 0 1 1 1 1 1
#>
#> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.
#> features
#> docs Christmas ruined opposition tax plan . United_States Sweden progressive
#> text1 1 1 1 1 1 1 0 0 0
#> text2 0 0 0 0 0 0 1 1 1
#> features
#> docs taxation
#> text1 0
#> text2 1
#> [ reached max_nfeat ... 1 more feature ]
#>
#> dfm_rm> # select based on character length
#> dfm_rm> dfm_select(dfmat, min_nchar = 5)
#> Document-feature matrix of: 2 documents, 7 features (50.00% sparse) and 0 docvars.
#> features
#> docs Christmas ruined opposition United_States Sweden progressive taxation
#> text1 1 1 1 0 0 0 0
#> text2 0 0 0 1 1 1 1
#>
#> dfm_rm> dfmat <- dfm(tokens(c("This is a document with lots of stopwords.",
#> dfm_rm+ "No if, and, or but about it: lots of stopwords.")))
#>
#> dfm_rm> dfmat
#> Document-feature matrix of: 2 documents, 18 features (38.89% sparse) and 0 docvars.
#> features
#> docs this is a document with lots of stopwords . no
#> text1 1 1 1 1 1 1 1 1 1 0
#> text2 0 0 0 0 0 1 1 1 1 1
#> [ reached max_nfeat ... 8 more features ]
#>
#> dfm_rm> dfm_remove(dfmat, stopwords("english"))
#> Document-feature matrix of: 2 documents, 6 features (25.00% sparse) and 0 docvars.
#> features
#> docs document lots stopwords . , :
#> text1 1 1 1 1 0 0
#> text2 0 1 1 1 2 1
#>
#> dfm_rm> toks <- tokens(c("this contains lots of stopwords",
#> dfm_rm+ "no if, and, or but about it: lots"),
#> dfm_rm+ remove_punct = TRUE)
#>
#> dfm_rm> fcmat <- fcm(toks)
#>
#> dfm_rm> fcmat
#> Feature co-occurrence matrix of: 12 by 12 features.
#> features
#> features this contains lots of stopwords no if and or but
#> this 0 1 1 1 1 0 0 0 0 0
#> contains 0 0 1 1 1 0 0 0 0 0
#> lots 0 0 0 1 1 1 1 1 1 1
#> of 0 0 0 0 1 0 0 0 0 0
#> stopwords 0 0 0 0 0 0 0 0 0 0
#> no 0 0 0 0 0 0 1 1 1 1
#> if 0 0 0 0 0 0 0 1 1 1
#> and 0 0 0 0 0 0 0 0 1 1
#> or 0 0 0 0 0 0 0 0 0 1
#> but 0 0 0 0 0 0 0 0 0 0
#> [ reached max_feat ... 2 more features, reached max_nfeat ... 2 more features ]
#>
#> dfm_rm> fcm_remove(fcmat, stopwords("english"))
#> Feature co-occurrence matrix of: 3 by 3 features.
#> features
#> features contains lots stopwords
#> contains 0 1 1
#> lots 0 0 1
#> stopwords 0 0 0
Created on 2023-09-17 with reprex v2.0.2
Thank you so much for your timely guidance, Dr. Benoit!
Gratefully,
JJ Wang, Ph.D.
Professor
Advanced Educational Studies Department
(661) 654-3048
California State University, Bakersfield
9001 Stockdale Hwy, Mail Stop: 22 EDUC
Bakersfield, CA 93311
From: Kenneth Benoit @.> Sent: Sunday, September 17, 2023 9:51 AM To: quanteda/quanteda.textplots @.> Cc: Jianjun Wang @.>; Comment @.> Subject: Re: [quanteda/quanteda.textplots] Plot similarity matrix using textplot_network() (#7)
example("dfm_remove", package = "quanteda")
Created on 2023-09-17 with reprex v2.0.2 [reprex.tidyverse.org]https://urldefense.com/v3/__https://reprex.tidyverse.org__;!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39wa3ZXX8dc$
— Reply to this email directly, view it on GitHub [github.com]https://urldefense.com/v3/__https://github.com/quanteda/quanteda.textplots/issues/7*issuecomment-1722517929__;Iw!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39waouyZR4Q$, or unsubscribe [github.com]https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AMF342ZQLU6EJHCH4LIKOX3X24TCBANCNFSM4UGAOFHQ__;!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39wafVlQfMo$. You are receiving this because you commented.Message ID: @.***>
If we treat a similarity matrix as a type of adjacency matrix, we can plot a semantic network using
textplot_network()
in a few steps. Why don't we make this an official function?