quanteda / quanteda.textplots

Plotting and visualisation for quanteda
GNU General Public License v3.0
6 stars 1 forks source link

Plot similarity matrix using textplot_network() #7

Open koheiw opened 5 years ago

koheiw commented 5 years ago

If we treat a similarity matrix as a type of adjacency matrix, we can plot a semantic network using textplot_network() in a few steps. Why don't we make this an official function?

require(quanteda)

mt <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords())
mt <- dfm_trim(mt, min_termfreq = 100)
sim <- textstat_proxy(mt, margin = "features")
textplot_network(quanteda:::as.fcm(as(sim, "dgTMatrix")), min_freq = 0.95)

rplot

kbenoit commented 5 years ago

Networks typically to reflect co-occurrence rather than similarity - are there examples of networks to show similarity? Either way it would be easy to write a method for textplot_network() that allows the return items from textstat_simil() as inputs.

A more common plot would be a heatmap of similarity called textplot_simil() that took a similarity input. This is a pretty common way to plot a matrix of correlations/similarities and easily available in ggplot2.

On 28 Oct 2018, at 05:20, Kohei Watanabe notifications@github.com<mailto:notifications@github.com> wrote:

If we treat a similarity matrix as a type of adjacency matrix, we can plot a semantic network using textplot_network() in a few steps. Why don't we make this an official function?

require(quanteda)

mt <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords()) mt <- dfm_trim(mt, min_termfreq = 100) sim <- textstat_proxy(mt, margin = "features") textplot_network(quanteda:::as.fcm(as(sim, "dgTMatrix")), min_freq = 0.95)

[rplot]https://user-images.githubusercontent.com/6572963/47614121-eafee100-dadd-11e8-89de-d9945baa3a5c.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/quanteda/quanteda/issues/1474, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACFMZvItd-JZj9RxHUnO19RE6EN2cjwuks5upXbqgaJpZM4X93XH.

jiongweilua commented 5 years ago

Hi, I wanted to chime in on this. I agree with Ken that when thinking about similarity, a heatmap-like visualisation is more intuitive to me.

However, I think that a collocation_network function might be a useful complementary function to textstat_collocations. The concept of collocation seems to me to lend itself naturally to a spatial expression. But from a function design point of view, I think there are some non-trivial challenges - e.g. scalability, replicability of visualisations, interactivity etc. Page 37 of this article has a nice overview: Towards Interactive Multidimensional Visualisations for Corpus Linguistics

Is this something we would like to explore?

randomgambit commented 5 years ago

@kbenoit is there any hacky way to do what you are referring to?

A more common plot would be a heatmap of similarity called textplot_simil() that took a similarity input. This is a pretty common way to plot a matrix of correlations/similarities and easily available in ggplot2.

kbenoit commented 5 years ago

I wouldn't call it hacky, but the code below works. The simil measures will yield positive values so we would ideally figure out a way to remove the values < 1.0.

library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

simmat <- corpus_subset(data_corpus_inaugural, Year > 1980) %>%
  dfm(remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("en")) %>%
  textstat_simil() %>%
  as.matrix()
simmat[1:5, 1:5]
##              1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton 1997-Clinton
## 1981-Reagan    1.0000000   0.6503200 0.4750618    0.5159960    0.5181002
## 1985-Reagan    0.6503200   1.0000000 0.5043065    0.5558569    0.6074780
## 1989-Bush      0.4750618   0.5043065 1.0000000    0.5037529    0.5311117
## 1993-Clinton   0.5159960   0.5558569 0.5037529    1.0000000    0.5961274
## 1997-Clinton   0.5181002   0.6074780 0.5311117    0.5961274    1.0000000

ggcorrplot::ggcorrplot(simmat, hc.order = TRUE, type = "lower")


corrplot::corrplot.mixed(simmat, order = "hclust", tl.col = "black")

wang93312 commented 9 months ago

Hope Dr. Benoit can offer some guidance on the following feedback --

install.packages("quanteda.textstat") Installing package into ‘C:/Users/jwang/AppData/Local/R/win-library/4.2’ (as ‘lib’ is unspecified) Warning in install.packages : package ‘quanteda.textstat’ is not available for this version of R

A version of this package for your version of R might be available elsewhere, see the ideas at https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

library(quanteda.textstat) Error in library(quanteda.textstat) : there is no package called ‘quanteda.textstat’

I tried different versions of R. None of the attempts worked. Thank you!

wang93312 commented 9 months ago

My apology -- I miss "s" in install.packages("quanteda.textstats")

Regarding the following warning messages -- 1: remove_punct, remove_numbers arguments are not used. 2: 'remove' is deprecated; use dfm_remove() instead

I am wondering if Dr. Benoit could provide an example on how to use dfm_remove() to replace remove.

Thank you!

kbenoit commented 9 months ago
example("dfm_remove", package = "quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> dfm_rm> dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.",
#> dfm_rm+                "Does the United_States or Sweden have more progressive taxation?")) %>%
#> dfm_rm+     dfm(tolower = FALSE)
#> 
#> dfm_rm> dict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
#> dfm_rm+                         wordsEndingInY = c("by", "my"),
#> dfm_rm+                         notintext = "blahblah"))
#> 
#> dfm_rm> dfm_select(dfmat, pattern = dict)
#> Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    My by United_States Sweden
#>   text1  1  1             0      0
#>   text2  0  0             1      1
#> 
#> dfm_rm> dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 1 feature (50.00% sparse) and 0 docvars.
#>        features
#> docs    by
#>   text1  1
#>   text2  0
#> 
#> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    My Christmas was by Does United_States
#>   text1  1         1   1  1    0             0
#>   text2  0         0   0  0    1             1
#> 
#> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 14 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    ruined your opposition tax plan . the or Sweden have
#>   text1      1    1          1   1    1 1   0  0      0    0
#>   text2      0    0          0   0    0 0   1  1      1    1
#> [ reached max_nfeat ... 4 more features ]
#> 
#> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 9 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    My was by your Does the or have more
#>   text1  1   1  1    1    0   0  0    0    0
#>   text2  0   0  0    0    1   1  1    1    1
#> 
#> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    Christmas ruined opposition tax plan . United_States Sweden progressive
#>   text1         1      1          1   1    1 1             0      0           0
#>   text2         0      0          0   0    0 0             1      1           1
#>        features
#> docs    taxation
#>   text1        0
#>   text2        1
#> [ reached max_nfeat ... 1 more feature ]
#> 
#> dfm_rm> # select based on character length
#> dfm_rm> dfm_select(dfmat, min_nchar = 5)
#> Document-feature matrix of: 2 documents, 7 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    Christmas ruined opposition United_States Sweden progressive taxation
#>   text1         1      1          1             0      0           0        0
#>   text2         0      0          0             1      1           1        1
#> 
#> dfm_rm> dfmat <- dfm(tokens(c("This is a document with lots of stopwords.",
#> dfm_rm+                       "No if, and, or but about it: lots of stopwords.")))
#> 
#> dfm_rm> dfmat
#> Document-feature matrix of: 2 documents, 18 features (38.89% sparse) and 0 docvars.
#>        features
#> docs    this is a document with lots of stopwords . no
#>   text1    1  1 1        1    1    1  1         1 1  0
#>   text2    0  0 0        0    0    1  1         1 1  1
#> [ reached max_nfeat ... 8 more features ]
#> 
#> dfm_rm> dfm_remove(dfmat, stopwords("english"))
#> Document-feature matrix of: 2 documents, 6 features (25.00% sparse) and 0 docvars.
#>        features
#> docs    document lots stopwords . , :
#>   text1        1    1         1 1 0 0
#>   text2        0    1         1 1 2 1
#> 
#> dfm_rm> toks <- tokens(c("this contains lots of stopwords",
#> dfm_rm+                  "no if, and, or but about it: lots"),
#> dfm_rm+                remove_punct = TRUE)
#> 
#> dfm_rm> fcmat <- fcm(toks)
#> 
#> dfm_rm> fcmat
#> Feature co-occurrence matrix of: 12 by 12 features.
#>            features
#> features    this contains lots of stopwords no if and or but
#>   this         0        1    1  1         1  0  0   0  0   0
#>   contains     0        0    1  1         1  0  0   0  0   0
#>   lots         0        0    0  1         1  1  1   1  1   1
#>   of           0        0    0  0         1  0  0   0  0   0
#>   stopwords    0        0    0  0         0  0  0   0  0   0
#>   no           0        0    0  0         0  0  1   1  1   1
#>   if           0        0    0  0         0  0  0   1  1   1
#>   and          0        0    0  0         0  0  0   0  1   1
#>   or           0        0    0  0         0  0  0   0  0   1
#>   but          0        0    0  0         0  0  0   0  0   0
#> [ reached max_feat ... 2 more features, reached max_nfeat ... 2 more features ]
#> 
#> dfm_rm> fcm_remove(fcmat, stopwords("english"))
#> Feature co-occurrence matrix of: 3 by 3 features.
#>            features
#> features    contains lots stopwords
#>   contains         0    1         1
#>   lots             0    0         1
#>   stopwords        0    0         0

Created on 2023-09-17 with reprex v2.0.2

wang93312 commented 9 months ago

Thank you so much for your timely guidance, Dr. Benoit!

Gratefully,

JJ Wang, Ph.D.

​Professor

Advanced Educational Studies Department

(661) 654-3048

California State University, Bakersfield

9001 Stockdale Hwy, Mail Stop: 22 EDUC

Bakersfield, CA 93311

https://www.csub.edu/aes


From: Kenneth Benoit @.> Sent: Sunday, September 17, 2023 9:51 AM To: quanteda/quanteda.textplots @.> Cc: Jianjun Wang @.>; Comment @.> Subject: Re: [quanteda/quanteda.textplots] Plot similarity matrix using textplot_network() (#7)

example("dfm_remove", package = "quanteda")

> Package version: 3.3.1

> Unicode version: 14.0

> ICU version: 71.1

> Parallel computing: 10 of 10 threads used.

> See https://quanteda.io [quanteda.io]https://urldefense.com/v3/__https://quanteda.io__;!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39waeLxwDeM$ for tutorials and examples.

>

> dfm_rm> dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.",

> dfm_rm+ "Does the United_States or Sweden have more progressive taxation?")) %>%

> dfm_rm+ dfm(tolower = FALSE)

>

> dfm_rm> dict <- dictionary(list(countries = c("United_States", "Sweden", "France"),

> dfm_rm+ wordsEndingInY = c("by", "my"),

> dfm_rm+ notintext = "blahblah"))

>

> dfm_rm> dfm_select(dfmat, pattern = dict)

> Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars.

> features

> docs My by United_States Sweden

> text1 1 1 0 0

> text2 0 0 1 1

>

> dfm_rm> dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)

> Document-feature matrix of: 2 documents, 1 feature (50.00% sparse) and 0 docvars.

> features

> docs by

> text1 1

> text2 0

>

> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")

> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.

> features

> docs My Christmas was by Does United_States

> text1 1 1 1 1 0 0

> text2 0 0 0 0 1 1

>

> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")

> Document-feature matrix of: 2 documents, 14 features (50.00% sparse) and 0 docvars.

> features

> docs ruined your opposition tax plan . the or Sweden have

> text1 1 1 1 1 1 1 0 0 0 0

> text2 0 0 0 0 0 0 1 1 1 1

> [ reached max_nfeat ... 4 more features ]

>

> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")

> Document-feature matrix of: 2 documents, 9 features (50.00% sparse) and 0 docvars.

> features

> docs My was by your Does the or have more

> text1 1 1 1 1 0 0 0 0 0

> text2 0 0 0 0 1 1 1 1 1

>

> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")

> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.

> features

> docs Christmas ruined opposition tax plan . United_States Sweden progressive

> text1 1 1 1 1 1 1 0 0 0

> text2 0 0 0 0 0 0 1 1 1

> features

> docs taxation

> text1 0

> text2 1

> [ reached max_nfeat ... 1 more feature ]

>

> dfm_rm> # select based on character length

> dfm_rm> dfm_select(dfmat, min_nchar = 5)

> Document-feature matrix of: 2 documents, 7 features (50.00% sparse) and 0 docvars.

> features

> docs Christmas ruined opposition United_States Sweden progressive taxation

> text1 1 1 1 0 0 0 0

> text2 0 0 0 1 1 1 1

>

> dfm_rm> dfmat <- dfm(tokens(c("This is a document with lots of stopwords.",

> dfm_rm+ "No if, and, or but about it: lots of stopwords.")))

>

> dfm_rm> dfmat

> Document-feature matrix of: 2 documents, 18 features (38.89% sparse) and 0 docvars.

> features

> docs this is a document with lots of stopwords . no

> text1 1 1 1 1 1 1 1 1 1 0

> text2 0 0 0 0 0 1 1 1 1 1

> [ reached max_nfeat ... 8 more features ]

>

> dfm_rm> dfm_remove(dfmat, stopwords("english"))

> Document-feature matrix of: 2 documents, 6 features (25.00% sparse) and 0 docvars.

> features

> docs document lots stopwords . , :

> text1 1 1 1 1 0 0

> text2 0 1 1 1 2 1

>

> dfm_rm> toks <- tokens(c("this contains lots of stopwords",

> dfm_rm+ "no if, and, or but about it: lots"),

> dfm_rm+ remove_punct = TRUE)

>

> dfm_rm> fcmat <- fcm(toks)

>

> dfm_rm> fcmat

> Feature co-occurrence matrix of: 12 by 12 features.

> features

> features this contains lots of stopwords no if and or but

> this 0 1 1 1 1 0 0 0 0 0

> contains 0 0 1 1 1 0 0 0 0 0

> lots 0 0 0 1 1 1 1 1 1 1

> of 0 0 0 0 1 0 0 0 0 0

> stopwords 0 0 0 0 0 0 0 0 0 0

> no 0 0 0 0 0 0 1 1 1 1

> if 0 0 0 0 0 0 0 1 1 1

> and 0 0 0 0 0 0 0 0 1 1

> or 0 0 0 0 0 0 0 0 0 1

> but 0 0 0 0 0 0 0 0 0 0

> [ reached max_feat ... 2 more features, reached max_nfeat ... 2 more features ]

>

> dfm_rm> fcm_remove(fcmat, stopwords("english"))

> Feature co-occurrence matrix of: 3 by 3 features.

> features

> features contains lots stopwords

> contains 0 1 1

> lots 0 0 1

> stopwords 0 0 0

Created on 2023-09-17 with reprex v2.0.2 [reprex.tidyverse.org]https://urldefense.com/v3/__https://reprex.tidyverse.org__;!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39wa3ZXX8dc$

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.com/v3/__https://github.com/quanteda/quanteda.textplots/issues/7*issuecomment-1722517929__;Iw!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39waouyZR4Q$, or unsubscribe [github.com]https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AMF342ZQLU6EJHCH4LIKOX3X24TCBANCNFSM4UGAOFHQ__;!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39wafVlQfMo$. You are receiving this because you commented.Message ID: @.***>