njtierney / syn

syn - the thesaurus
http://syn.njtierney.com/
51 stars 4 forks source link

add vignette demonstrating application of syn in text analysis #22

Open njtierney opened 5 years ago

njtierney commented 5 years ago

I imagine that there would be some application of deriving synonyms of words to assist in parts of text analysis, but I do not work in this area. Perhaps @juliasilge @dgrtwo or @kbenoit might have an idea?

kbenoit commented 5 years ago

It would be a really interesting application to have a "dictionary" of synonyms and then to use a function such as quanteda::tokens_lookup() to convert the synonym matches to their "key". With such a dictionary, that function would work out of the box.

However, to apply this to a series of tokens, there would need to be some priority rules about conversion to avoid cycling or indeterminacy, maybe based on frequency. So great -> good, terrific -> good, but a match for good does not become great. There would also need to be a way to choose which word to select from a list of multiple synonyms. Frequency is probably the best criterion.

Package looks great!

njtierney commented 5 years ago

Thanks for your thoughts, @kbenoit :)

I'm not quite sure how to avoid things as you said, so good doesn't become great, as an example. I'm also not sure how to select from the list of multiple synonyms - there are about 600 synonyms for "cool", and some of them are pretty wacky, like "Buddha-like composure".

Now, on to your note about a "dictionary" of synonyms, I had a fiddle with the dictionary function in quanteda below. I have to preface this with the fact that I don't really know what I'm doing with text analysis, but I did manage to get it to work without errors. But it might not really make sense here!

library(syn)
library(quanteda)
#> Package version: 1.3.14
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View

mycorpus <- corpus_subset(data_corpus_inaugural, Year>1900)
mydict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
                          opposition = c("Opposition", "reject", "notincorpus"),
                          taxing = "taxing",
                          taxation = "taxation",
                          taxregex = "tax*",
                          country = "america"))

mydict_syn <- dictionary(syns(c("christmas",
                                "opposition",
                                "taxing",
                                "taxation",
                                "country")))

head(dfm(mycorpus, dictionary = mydict))
#> Document-feature matrix of: 6 documents, 6 features (63.9% sparse).
#> 6 x 6 sparse Matrix of class "dfm"
#>                 features
#> docs             christmas opposition taxing taxation taxregex country
#>   1901-McKinley          0          2      0        1        1       0
#>   1905-Roosevelt         0          0      0        0        0       0
#>   1909-Taft              0          1      0        4        6       4
#>   1913-Wilson            0          0      0        1        1       0
#>   1917-Wilson            0          0      0        0        0       2
#>   1921-Harding           0          0      0        1        2      15

head(dfm(mycorpus, dictionary = mydict_syn))
#> Document-feature matrix of: 6 documents, 5 features (23.3% sparse).
#> 6 x 5 sparse Matrix of class "dfm"
#>                 features
#> docs             christmas opposition taxing taxation country
#>   1901-McKinley          0          2      7        1      17
#>   1905-Roosevelt         0          0      3        1      12
#>   1909-Taft              0         12     22        9      32
#>   1913-Wilson            0          3      7        4      14
#>   1917-Wilson            0         12      9        4      17
#>   1921-Harding           0         14     25        5      25

# subset a dictionary
mydict[1:2]
#> Dictionary object with 2 key entries.
#> - [christmas]:
#>   - christmas, santa, holiday
#> - [opposition]:
#>   - opposition, reject, notincorpus
mydict[c("christmas", "opposition")]
#> Dictionary object with 2 key entries.
#> - [christmas]:
#>   - christmas, santa, holiday
#> - [opposition]:
#>   - opposition, reject, notincorpus
mydict[["opposition"]]
#> [1] "opposition"  "reject"      "notincorpus"

# subset the synonym dictionary
mydict_syn[1:2]
#> Dictionary object with 2 key entries.
#> - [christmas]:
#> - [opposition]:
#>   - adversary, adversity, agreement to disagree, alienation, allegory, analogy, antagonism, antagonist, antagonistic, anteposition, antipathy, antithesis, antithetical, apostasy, argumentation, arrest, arrestation, arrestment, assailant, at daggers drawn, averseness, aversion, backlash, backwardness, balancing, ban, blackball, blackballing, blockage, blocking, challenge, check, clashing, clogging, closing up, closure, collision, combatant, combative reaction, comparative anatomy, comparative degree, comparative grammar, comparative judgment, comparative linguistics, comparative literature, comparative method, compare, comparing, comparison, competing, competition, competitive, competitor, complaint, con, conflict, conflicting, confrontation, confrontment, confutation, constriction, contention, contradiction, contradistinction, contraindication, contraposition, contrariety, contrast, contrastiveness, controversy, correlation, counter-culture, counteraction, counterposition, counterworking, cramp, crankiness, cross-purposes, crotchetiness, cursoriness, defiance, delay, demur, departure, detainment, detention, deviation, difference, dim view, disaccord, disaccordance, disagreement, disappointment, disapprobation, disapproval, disconformity, discongruity, discontent, discontentedness, discontentment, discord, discordance, discordancy, discrepancy, discreteness, disenchantment, disesteem, disfavor, disgruntlement, disharmony, disillusion, disillusionment, disinclination, disobedience, disparity, displeasure, dispute, disrelish, disrespect, dissatisfaction, dissension, dissent, dissentience, dissidence, dissimilarity, dissonance, distaste, distinction, distinctiveness, distinctness, disunion, disunity, divergence, divergency, diversity, dropping out, enemy, exclusion, faction, far cry, fixation, flak, foe, foeman, foot-dragging, fractiousness, friction, grudging consent, grudgingness, hampering, heterogeneity, hindering, hindrance, holdback, holdup, hostile, hostility, impediment, in opposition, inaccordance, incompatibility, incongruity, inconsistency, inconsonance, indignation, indisposedness, indisposition, indocility, inequality, inharmoniousness, inharmony, inhibition, inimicalness, interference, interruption, intractableness, irreconcilability, jarring, kick, lack of enthusiasm, lack of zeal, let, likening, low estimation, low opinion, matching, metaphor, minority opinion, mixture, mutinousness, negation, negativism, nolition, nonagreement, nonassent, nonconcurrence, nonconformity, nonconsent, noncooperation, nuisance value, objection, obstinacy, obstruction, obstructionism, occlusion, odds, opponent, opposed, opposing, opposing party, opposite camp, oppositeness, opposition, opposure, oppugnance, oppugnancy, ostracism, other side, otherness, parallelism, passive resistance, perfunctoriness, perverseness, perversity, polar opposition, polarity, polarization, posing against, proportion, protest, reaction, rebuff, recalcitrance, recalcitrancy, recalcitration, recoil, recusance, recusancy, refractoriness, refusal, rejection, relation, reluctance, renitence, renitency, repellence, repellency, repercussion, repression, repudiation, repugnance, repulse, repulsion, resistance, restraint, restriction, retardation, retardment, revolt, rival, secession, separateness, setback, showdown, simile, similitude, slowness, squeeze, stand, stranglehold, stricture, stubbornness, sulk, sulkiness, sulks, sullenness, suppression, swimming upstream, the loyal opposition, the opposition, thumbs-down, trope of comparison, unconformity, uncooperativeness, underground, unenthusiasm, unfriendliness, unhappiness, unharmoniousness, unlikeness, unorthodoxy, unwillingness, variance, variation, variegation, variety, weighing, withdrawal, withstanding
mydict_syn[c("christmas", "opposition")]
#> Dictionary object with 2 key entries.
#> - [christmas]:
#> - [opposition]:
#>   - adversary, adversity, agreement to disagree, alienation, allegory, analogy, antagonism, antagonist, antagonistic, anteposition, antipathy, antithesis, antithetical, apostasy, argumentation, arrest, arrestation, arrestment, assailant, at daggers drawn, averseness, aversion, backlash, backwardness, balancing, ban, blackball, blackballing, blockage, blocking, challenge, check, clashing, clogging, closing up, closure, collision, combatant, combative reaction, comparative anatomy, comparative degree, comparative grammar, comparative judgment, comparative linguistics, comparative literature, comparative method, compare, comparing, comparison, competing, competition, competitive, competitor, complaint, con, conflict, conflicting, confrontation, confrontment, confutation, constriction, contention, contradiction, contradistinction, contraindication, contraposition, contrariety, contrast, contrastiveness, controversy, correlation, counter-culture, counteraction, counterposition, counterworking, cramp, crankiness, cross-purposes, crotchetiness, cursoriness, defiance, delay, demur, departure, detainment, detention, deviation, difference, dim view, disaccord, disaccordance, disagreement, disappointment, disapprobation, disapproval, disconformity, discongruity, discontent, discontentedness, discontentment, discord, discordance, discordancy, discrepancy, discreteness, disenchantment, disesteem, disfavor, disgruntlement, disharmony, disillusion, disillusionment, disinclination, disobedience, disparity, displeasure, dispute, disrelish, disrespect, dissatisfaction, dissension, dissent, dissentience, dissidence, dissimilarity, dissonance, distaste, distinction, distinctiveness, distinctness, disunion, disunity, divergence, divergency, diversity, dropping out, enemy, exclusion, faction, far cry, fixation, flak, foe, foeman, foot-dragging, fractiousness, friction, grudging consent, grudgingness, hampering, heterogeneity, hindering, hindrance, holdback, holdup, hostile, hostility, impediment, in opposition, inaccordance, incompatibility, incongruity, inconsistency, inconsonance, indignation, indisposedness, indisposition, indocility, inequality, inharmoniousness, inharmony, inhibition, inimicalness, interference, interruption, intractableness, irreconcilability, jarring, kick, lack of enthusiasm, lack of zeal, let, likening, low estimation, low opinion, matching, metaphor, minority opinion, mixture, mutinousness, negation, negativism, nolition, nonagreement, nonassent, nonconcurrence, nonconformity, nonconsent, noncooperation, nuisance value, objection, obstinacy, obstruction, obstructionism, occlusion, odds, opponent, opposed, opposing, opposing party, opposite camp, oppositeness, opposition, opposure, oppugnance, oppugnancy, ostracism, other side, otherness, parallelism, passive resistance, perfunctoriness, perverseness, perversity, polar opposition, polarity, polarization, posing against, proportion, protest, reaction, rebuff, recalcitrance, recalcitrancy, recalcitration, recoil, recusance, recusancy, refractoriness, refusal, rejection, relation, reluctance, renitence, renitency, repellence, repellency, repercussion, repression, repudiation, repugnance, repulse, repulsion, resistance, restraint, restriction, retardation, retardment, revolt, rival, secession, separateness, setback, showdown, simile, similitude, slowness, squeeze, stand, stranglehold, stricture, stubbornness, sulk, sulkiness, sulks, sullenness, suppression, swimming upstream, the loyal opposition, the opposition, thumbs-down, trope of comparison, unconformity, uncooperativeness, underground, unenthusiasm, unfriendliness, unhappiness, unharmoniousness, unlikeness, unorthodoxy, unwillingness, variance, variation, variegation, variety, weighing, withdrawal, withstanding
head(mydict_syn[["opposition"]])
#> [1] "adversary"             "adversity"             "agreement to disagree"
#> [4] "alienation"            "allegory"              "analogy"
tail(mydict_syn[["opposition"]])
#> [1] "variation"    "variegation"  "variety"      "weighing"    
#> [5] "withdrawal"   "withstanding"

Created on 2018-11-28 by the reprex package (v0.2.1)

dgrtwo commented 5 years ago

I don't know too much about this either, but my first instinct upon seeing the package was that to work with tidytext I'd first arrange the synonyms as a tidy dataset.

library(dplyr)
library(tibble)
library(tidyr)

word_synonyms <- tibble::enframe(syn:::words_idx) %>%
  unnest(value) %>%
  transmute(word = name,
            synonym = syn:::all_words[value])

For instance, this allows us to find the most common synonym for each word (common defined as "being a synonym to many words"). This helps solve Nick's question about about choosing one synonym for each for a dictionary.

# Find the most common synonym for each
most_common_synonyms <- word_synonyms %>%
  add_count(synonym) %>%
  arrange(desc(n)) %>%
  distinct(word, .keep_all = TRUE) %>%
  arrange(word) %>%
  select(-n)

If we wanted to be a bit silly, we could then replace every word in a text with the most common synonym, turning "Sense and Sensibility" into "Point and Note".

library(janeaustenr)
library(tidytext)

austen_books() %>%
  unnest_tokens(word, text) %>%
  left_join(most_common_synonyms, by = "word") %>%
  mutate(synonym = coalesce(synonym, word))
# A tibble: 725,055 x 3
   book                word        synonym
   <fct>               <chr>       <chr>  
 1 Sense & Sensibility sense       point  
 2 Sense & Sensibility and         and    
 3 Sense & Sensibility sensibility note   
 4 Sense & Sensibility by          round  
 5 Sense & Sensibility jane        jane   
 6 Sense & Sensibility austen      austen 
 7 Sense & Sensibility 1811        1811   
 8 Sense & Sensibility chapter     point  
 9 Sense & Sensibility 1           1      
10 Sense & Sensibility the         the    
# ... with 725,045 more rows

I don't know if there's a vignette to be made here; this would just be the direction I'd go in tidying syn.

njtierney commented 5 years ago

Thanks for that @dgrtwo ! :)

kbenoit commented 5 years ago

Following @njtierney 's example above, I experimented a bit myself and found a few issues that I illustrate below. I use here the language I've adopted for text analysis dictionaries, in terms of the key (the target word whose synonyms are retrieved) and its values (the synonyms retrieved).

If the use case is to convert values matches to a key, say in order to simplify vocabulary by reducing synonyms to their canonical concepts - like "good" or "bad" - then there are going to be a lot of overlapping matches. For instance

library("syn")
library("quanteda")
## Package version: 1.3.16
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

syndict <- dictionary(list(
  good = syn("good"),
  excellent = syn("excellent")
))

# show the overlap
lapply(syndict, head, 20)
## $good
##  [1] "able to pay"        "absolutely"         "acceptable"        
##  [4] "accomplished"       "according to hoyle" "ace"               
##  [7] "actual"             "adept"              "adequate"          
## [10] "admirable"          "admissible"         "adroit"            
## [13] "advantage"          "advantageous"       "advisable"         
## [16] "affable"            "affectionate"       "agreeable"         
## [19] "all right"          "all-knowing"       
## 
## $excellent
##  [1] "a cut above"   "above"         "adept"         "admirable"    
##  [5] "adroit"        "advantageous"  "aesthetic"     "aggrandized"  
##  [9] "ahead"         "al"            "apotheosized"  "apt"          
## [13] "artistic"      "ascendant"     "attic"         "auspicious"   
## [17] "authoritative" "awesome"       "bang-up"       "banner"
syndict[["good"]][180:190]
##  [1] "even"          "evenhanded"    "everlasting"   "exactly"      
##  [5] "excellent"     "exemplary"     "expedient"     "expert"       
##  [9] "exquisite"     "extensive"     "extraordinary"
syndict[["excellent"]][80:100]
##  [1] "extraordinary" "fair"          "famous"        "fancy"        
##  [5] "fantastic"     "favorable"     "fine"          "finer"        
##  [9] "first-class"   "first-rate"    "first-string"  "glorified"    
## [13] "good"          "goodish"       "goodly"        "graceful"     
## [17] "grade a"       "grand"         "great"         "greater"      
## [21] "handy"

Here, we see that "good" is a synonym of excellent, and vice-versa. In general, these are very long entries in the thesaurus, so we have lots of overlaps and linkages. Almost certainly, this thesaurus is too inclusive. "according to hoyle" is a synonym of "good"? 🤔

This means if we use the entries as a dictionary, we will get multiple matches. Here, the token "good" becomes its key of both "GOOD" and "EXCELLENT" (first sentence) and the other terms produce similar matches. Some priority rule is needed.

txt <- "Good? It's fantastic, great, awesome, even excellent!"
toks <- tokens(txt)

tokens_lookup(toks, syndict, exclusive = FALSE, capkeys = TRUE)
## tokens from 1 document.
## text1 :
##  [1] "GOOD"      "EXCELLENT" "?"         "It's"      "GOOD"     
##  [6] "EXCELLENT" ","         "GOOD"      "EXCELLENT" ","        
## [11] "EXCELLENT" ","         "GOOD"      "GOOD"      "EXCELLENT"
## [16] "!"

There are no doubt other use cases, but at least this illustrates a problem to be solved, as well as showing how the current thesaurus is too inclusive (but even a more restricted one will not solve the first problem, since there will always be overlap). But resolving #1 might improve things a lot.

Great project!