koheiw / seededlda

LDA for semisupervised topic modeling
https://koheiw.github.io/seededlda/
73 stars 15 forks source link

tokens or tokens_ngrams for seededlda, and export to LDAvis #62

Closed hahoangnhan closed 1 year ago

hahoangnhan commented 1 year ago

Hi @koheiw ,

I have two issues that I hope to hear from you some advice and suggestion.

I - tokens or tokens_ngrams

Just to know and do my case correctly. Please give me any advice or suggestion.

I have a list of guided keywords which are both uni and n-grams. So, when tokenizing my corpus, I am not sure about using tokens or tokens_ngrams is the most suitable one. Q1: What should I do if I expect

  1. Input: both seeded uni and bi-grams. Output: newly explored keywords are both uni and bi-grams.
  2. Input: both seeded uni and bi-grams. Output: newly explored keywords are only uni grams?

Q2: If I have n-grams, should I use tokens_compound?

Q3: do I need to remove stop words when using n-grams?

Please see my example code below.

# load the required packages
library(quanteda)
library(seededlda)

# Create a corpus from your text data
my_corpus <- corpus(c("The field of artificial intelligence is expanding with new breakthroughs.",
                      "Data analysis and machine learning are important in extracting insights.",
                      "Cybersecurity measures are crucial to protect sensitive information.",
                      "Health and fitness play a significant role in overall well-being.",
                      "The economy is affected by global market trends and trade policies.",
                      "Education is essential for personal growth and career opportunities.",
                      "Climate change and environmental sustainability are pressing issues.",
                      "Social media platforms are shaping communication and information sharing.",
                      "The entertainment industry is evolving with new digital platforms.",
                      "Urbanization and infrastructure development are transforming cities."))

# Define the seed words for each topic
seed_list <- list(topic1 = c("artificial intelligence", "machine learning", "data analysis"),
                  topic2 = c("cybersecurity", "privacy", "data protection"),
                  topic3 = c("climate change", "sustainability", "environmental impact")) 

dict <- quanteda::dictionary(seed_list)

# Create a document-feature matrix with unigram and bigram features
toks <- tokens(my_corpus, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE) |>  
    tokens_select(min_nchar = 2) |> 
    tokens() |> ## I'M NOT SURE HERE ###
    #tokens_ngrams(n = 1:2) |> ## I'M NOT SURE HERE ###
    tokens_compound(dict) # for multi-word expressions

dfmt <- dfm(toks) |>   
    dfm_remove(stopwords('en')) |>   
    dfm_trim(min_termfreq = 0.5, termfreq_type = "quantile", 
             max_docfreq = 0.5, docfreq_type = "prop")

# Run seeded LDA
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE)

# extract topic classification for each document 
topics <- topics(slda)

# get most frequent keywords of each topic

keywords <- terms(slda, 20) |> as_tibble()

II - Use LDAvis to visualize and modify keywords, topics

As I need to do several rounds of seeded LDA to make the topics are separated clearly and combine topics if they are quite close to each other. I would like to use LDAvis to see more intuitively. I tried and got the error as below:

> # Create the interactive visualization using LDAvis
> lda_vis <- createJSON(phi = slda$phi, 
+                       theta = slda$theta,
+                       doc_lengths = slda$doc_lengths,
+                       vocab = slda$vocab,
+                       term_frequency = slda$term_frequency)
Error in createJSON(phi = slda$phi, theta = slda$theta, doc_lengths = slda$doc_lengths,  : 
  Length of doc.length not equal 
      to the number of rows in theta; both should be equal to the number of 
      documents in the data.

Q4: Is it because LDAvis does not work with seededlda output? If it works, what should I do? Is there any alternative visualization tools?

Thank you so much for your time and consideration.

Best,

HHN

koheiw commented 1 year ago

Thanks for the nice example code. Please see my inline comments.

# load the required packages
library(quanteda)
library(seededlda)

# Create a corpus from your text data
my_corpus <- corpus(c("The field of artificial intelligence is expanding with new breakthroughs.",
                      "Data analysis and machine learning are important in extracting insights.",
                      "Cybersecurity measures are crucial to protect sensitive information.",
                      "Health and fitness play a significant role in overall well-being.",
                      "The economy is affected by global market trends and trade policies.",
                      "Education is essential for personal growth and career opportunities.",
                      "Climate change and environmental sustainability are pressing issues.",
                      "Social media platforms are shaping communication and information sharing.",
                      "The entertainment industry is evolving with new digital platforms.",
                      "Urbanization and infrastructure development are transforming cities."))

# Define the seed words for each topic
seed_list <- list(topic1 = c("artificial intelligence", "machine learning", "data analysis"),
                  topic2 = c("cybersecurity", "privacy", "data protection"),
                  topic3 = c("climate change", "sustainability", "environmental impact")) 

dict <- dictionary(seed_list)

# Create a document-feature matrix with unigram and bigram features
toks <- tokens(my_corpus, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE) |>  
  tokens_remove(stopwords("en"), min_nchar = 2) |> # remove common words using stopwords 
  #tokens_ngrams(n = 1:2) |> # unigrams of phrases usually appear in the same topic
  tokens_compound(dict) # for multi-word expressions

dfmt <- dfm(toks) |>   
  dfm_remove(stopwords('en')) |>   
  dfm_trim(min_termfreq = 0.5, termfreq_type = "quantile") # max_docfreq might remove too many words

# Run seeded LDA
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE) # consider using batch_size and auto_iter when data is large

# extract topic classification for each document 
topics <- topics(slda)

terms(slda)
#>       topic1                    topic2          topic3          
#>  [1,] "artificial_intelligence" "cybersecurity" "climate_change"
#>  [2,] "data_analysis"           "information"   "sustainability"
#>  [3,] "machine_learning"        "expanding"     "play"          
#>  [4,] "field"                   "new"           "education"     
#>  [5,] "new"                     "crucial"       "essential"     
#>  [6,] "breakthroughs"           "protect"       "personal"      
#>  [7,] "important"               "affected"      "growth"        
#>  [8,] "measures"                "market"        "environmental" 
#>  [9,] "sensitive"               "trade"         "pressing"      
#> [10,] "significant"             "career"        "issues"        
#>       other                    
#>  [1,] "platforms"              
#>  [2,] "extracting"             
#>  [3,] "insights"               
#>  [4,] "health"                 
#>  [5,] "fitness"                
#>  [6,] "economy"                
#>  [7,] "trends"                 
#>  [8,] "policies"               
#>  [9,] "field"                  
#> [10,] "artificial_intelligence"

I have never used LDAvis seriously but please try if this works. I could add a method for this to my package.


library(LDAvis)
#> Loading required package: LDAvis
lda_vis <- createJSON(phi = slda$phi, 
                      theta = slda$theta,
                      doc.length = ntoken(slda$data),
                      vocab = featnames(slda$data),
                      term.frequency = featfreq(slda$data))
serVis(lda_vis)

Created on 2023-06-06 with reprex v2.0.2

hahoangnhan commented 1 year ago

Hi,

Thanks for your comments. It could be great if you could give me additional advice.

Q1: About bi-grams and tokens # unigrams of phrases usually appear in the same topic ---> I understand your points. However, sometimes I need to explore new bi-grams for other purposes (e.g., to have a dictionary to test out-of-sample). Can I do with #tokens_ngrams(n = 1:2)?

And, if that is the case, is removing stopwords a good choice? We will have some bi-grams that did not appear in the original corpus. For example, A - stopwords - B --> removing stopword creates a new bi-gram A B after tokenizing.

I very much appreciate to hear your experience.

Q2: About the LDAvis, it still doesn't work. Do you happen to know any visualization we can use to understand how good the classification is? For example, the distance between topics allows us to (1) combine close topics or (2) separate dispersed topics.

Best, HHN

koheiw commented 1 year ago

LDAvis actually worked on my laptop.

image

koheiw commented 1 year ago

I almost never use tokens_ngrams() because it creates a lot of junk tokens. Instead, I identify statistically associated bigrams using textstat_collocations().

# load the required packages
library(quanteda)
library(quanteda.textstats)
library(seededlda)

# Create a corpus from your text data
my_corpus <- corpus(c("The field of artificial intelligence is expanding with new breakthroughs.",
                      "Data analysis and machine learning are important in extracting insights.",
                      "Cybersecurity measures are crucial to protect sensitive information.",
                      "Health and fitness play a significant role in overall well-being.",
                      "The economy is affected by global market trends and trade policies.",
                      "Education is essential for personal growth and career opportunities.",
                      "Climate change and environmental sustainability are pressing issues.",
                      "Social media platforms are shaping communication and information sharing.",
                      "The entertainment industry is evolving with new digital platforms.",
                      "Urbanization and infrastructure development are transforming cities."))

# Define the seed words for each topic
seed_list <- list(topic1 = c("artificial intelligence", "machine learning", "data analysis"),
                  topic2 = c("cybersecurity", "privacy", "data protection"),
                  topic3 = c("climate change", "sustainability", "environmental impact")) 

dict <- dictionary(seed_list)

# Create a document-feature matrix with unigram and bigram features
toks <- tokens(my_corpus, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE) |>  
  tokens_remove(stopwords("en"), min_nchar = 2, padding = TRUE) # keep padding to avoid finding artificial ngrams

col <- textstat_collocations(toks, min_count = 1) # min_count should be much larger with large data
print(col)
#>                     collocation count count_nested length   lambda        z
#> 1       artificial intelligence     1            0      2 6.180017 2.856992
#> 2          career opportunities     1            0      2 6.180017 2.856992
#> 3                climate change     1            0      2 6.180017 2.856992
#> 4        cybersecurity measures     1            0      2 6.180017 2.856992
#> 5                 data analysis     1            0      2 6.180017 2.856992
#> 6        entertainment industry     1            0      2 6.180017 2.856992
#> 7  environmental sustainability     1            0      2 6.180017 2.856992
#> 8           extracting insights     1            0      2 6.180017 2.856992
#> 9                  fitness play     1            0      2 6.180017 2.856992
#> 10                global market     1            0      2 6.180017 2.856992
#> 11          information sharing     1            0      2 6.180017 2.856992
#> 12   infrastructure development     1            0      2 6.180017 2.856992
#> 13             machine learning     1            0      2 6.180017 2.856992
#> 14                market trends     1            0      2 6.180017 2.856992
#> 15           overall well-being     1            0      2 6.180017 2.856992
#> 16              personal growth     1            0      2 6.180017 2.856992
#> 17              pressing issues     1            0      2 6.180017 2.856992
#> 18            protect sensitive     1            0      2 6.180017 2.856992
#> 19        shaping communication     1            0      2 6.180017 2.856992
#> 20             significant role     1            0      2 6.180017 2.856992
#> 21                 social media     1            0      2 6.180017 2.856992
#> 22               trade policies     1            0      2 6.180017 2.856992
#> 23          transforming cities     1            0      2 6.180017 2.856992
#> 24            digital platforms     1            0      2 5.068904 2.771130
#> 25              media platforms     1            0      2 5.068904 2.771130
#> 26            new breakthroughs     1            0      2 5.068904 2.771130
#> 27                  new digital     1            0      2 5.068904 2.771130
#> 28        sensitive information     1            0      2 5.068904 2.771130

toks <- toks |>
  tokens_compound(col[col$z > 2,]) |># z > 2 only statistically significant collocations
  tokens_compound(dict) # for multi-word expressions

dfmt <- dmf(toks, remove_padding = TRUE) # do not include paddings

Created on 2023-06-07 with reprex v2.0.2

hahoangnhan commented 1 year ago

Hi @koheiw,

Thanks for the great clarification.

  1. Q1: How can I adjust this line to have both unigrams and collocations of bi-grams, for example, size = c(1:2) ? As your col shows only collocations with size = 2

col <- textstat_collocations(toks, min_count = 1) # min_count should be much larger with large data

  1. Q2: is it correct? I think dfm should be the correct function. dfmt <- dmf(toks, remove_padding = TRUE) # do not include paddings

  2. _Q3: What is a good setting for dfm_trim to make sure that words/collocations are typical in each topic and are less likely to appear in other topics? I mean min_termfreq, min_termfreq etc. to add to the chunk below:_

    toks <- toks |>
    tokens_compound(col[col$z > 2,]) |># z > 2 only statistically significant collocations
    tokens_compound(dict) # for multi-word expressions

    LDAvis works well right now after the following:

_4. Q4: But there are some collocations with 3 words (3-grams). I have checked and there was no collocations in col has 0 ntokens > 2. But the seedlda outcomes look weird (eg., "protect_sensitiveinformation"), and in my real data, there are some 4,5-grams. Not sure if I did something wrong?
My corrected code:

# load the required packages
library(quanteda)
library(quanteda.textstats)
library(seededlda)
library(LDAvis)

# Create a corpus from your text data
my_corpus <- corpus(c("The field of artificial intelligence is expanding with new breakthroughs.",
                      "Data analysis and machine learning are important in extracting insights.",
                      "Cybersecurity measures are crucial to protect sensitive information.",
                      "Health and fitness play a significant role in overall well-being.",
                      "The economy is affected by global market trends and trade policies.",
                      "Education is essential for personal growth and career opportunities.",
                      "Climate change and environmental sustainability are pressing issues.",
                      "Social media platforms are shaping communication and information sharing.",
                      "The entertainment industry is evolving with new digital platforms.",
                      "Urbanization and infrastructure development are transforming cities."))

# Define the seed words for each topic
seed_list <- list(topic1 = c("artificial intelligence", "machine learning", "data analysis"),
                  topic2 = c("cybersecurity", "privacy", "data protection"),
                  topic3 = c("climate change", "sustainability", "environmental impact")) 

dict <- dictionary(seed_list)

# Create a document-feature matrix with unigram and bigram features
toks <- tokens(my_corpus, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE) |>  
    tokens_remove(stopwords("en"), min_nchar = 2, padding = TRUE) # keep padding to avoid finding artificial ngrams

col <- textstat_collocations(toks, min_count = 1) # min_count should be much larger with large data
print(col)

 # > print(col)
# collocation count count_nested length   lambda        z
# 1       artificial intelligence     1            0      2 6.180017 2.856992
# 2          career opportunities     1            0      2 6.180017 2.856992
# 3                climate change     1            0      2 6.180017 2.856992
# 4        cybersecurity measures     1            0      2 6.180017 2.856992
# 5                 data analysis     1            0      2 6.180017 2.856992
# 6        entertainment industry     1            0      2 6.180017 2.856992
# 7  environmental sustainability     1            0      2 6.180017 2.856992
# 8           extracting insights     1            0      2 6.180017 2.856992
# 9                  fitness play     1            0      2 6.180017 2.856992
# 10                global market     1            0      2 6.180017 2.856992
# 11          information sharing     1            0      2 6.180017 2.856992
# 12   infrastructure development     1            0      2 6.180017 2.856992
# 13             machine learning     1            0      2 6.180017 2.856992
# 14                market trends     1            0      2 6.180017 2.856992
# 15           overall well-being     1            0      2 6.180017 2.856992
# 16              personal growth     1            0      2 6.180017 2.856992
# 17              pressing issues     1            0      2 6.180017 2.856992
# 18            protect sensitive     1            0      2 6.180017 2.856992
# 19        shaping communication     1            0      2 6.180017 2.856992
# 20             significant role     1            0      2 6.180017 2.856992
# 21                 social media     1            0      2 6.180017 2.856992
# 22               trade policies     1            0      2 6.180017 2.856992
# 23          transforming cities     1            0      2 6.180017 2.856992
# 24            digital platforms     1            0      2 5.068904 2.771130
# 25              media platforms     1            0      2 5.068904 2.771130
# 26            new breakthroughs     1            0      2 5.068904 2.771130
# 27                  new digital     1            0      2 5.068904 2.771130
# 28        sensitive information     1            0      2 5.068904 2.771130

toks <- toks |>
    tokens_compound(col[col$z > 2,]) |># z > 2 only statistically significant collocations
    tokens_compound(dict) # for multi-word expressions

dfmt <- dfm(toks, remove_padding = TRUE) |> # do not include paddings
    dfm_trim(min_termfreq = 0.3, termfreq_type = "quantile") # max_docfreq might remove too many words

# Run seeded LDA
slda <- textmodel_seededlda(dfmt, dict, residual = F) # consider using batch_size and auto_iter when data is large

# extract topic classification for each document 
topics <- topics(slda)

terms(slda)

 # topic1                          topic2                        
# [1,] "artificial_intelligence"       "field"                       
# [2,] "data_analysis"                 "cybersecurity_measures"      
# [3,] "machine_learning"              "health"                      
# [4,] "expanding"                     "economy"                     
# [5,] "new_breakthroughs"             "global_market_trends"        
# [6,] "important"                     "education"                   
# [7,] "protect_sensitive_information" "personal_growth"             
# [8,] "significant_role"              "career_opportunities"        
# [9,] "overall_well-being"            "environmental_sustainability"
# [10,] "shaping_communication"         "pressing_issues"             
# topic3                  
# [1,] "climate_change"        
# [2,] "extracting_insights"   
# [3,] "crucial"               
# [4,] "fitness_play"          
# [5,] "affected"              
# [6,] "trade_policies"        
# [7,] "essential"             
# [8,] "entertainment_industry"
# [9,] "evolving"              
# [10,] "new_digital_platforms" 
lda_vis <- createJSON(phi = slda$phi, 
                      theta = slda$theta,
                      doc.length = ntoken(slda$data),
                      vocab = featnames(slda$data),
                      term.frequency = featfreq(slda$data))
serVis(lda_vis)

Your quanteda family offers several super functions (that's why I love it), so more examples might be better to illustrate how they work in different contexts to let users be able to maximize their uses.

Thanks, HHN

hahoangnhan commented 1 year ago

Sorry for bothering again, I would like to know how textstat_collocations works correctly. The code below is from my real data:

# Create a document-feature matrix with unigram and bigram features
toks <- tokens(corpus_sentence_triple, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE, remove_url = TRUE) |>  
    tokens_remove(stopwords("en"), min_nchar = 2, padding = TRUE) # keep padding to avoid finding artificial ngrams

col <- textstat_collocations(toks, size = 2, min_count = 1) # min_count should be much larger with large data
print(col)

toks <- toks |>
    tokens_compound(col[col$z > 2,]) |># z > 2 only statistically significant collocations
    tokens_compound(commitment_top_dict) # for multi-word expressions

dfmt <- dfm(toks, remove_padding = TRUE) |>  # do not include paddings
    dfm_trim(min_termfreq = 0.2, 
             max_termfreq = 0.8,
             termfreq_type = "quantile") 

And I see the collocations explored by seededlda have n-grams (n = 2, 3, 4, I guided by only some uni and bi-grams) even size = 2 in textstat_collocations(toks, size = 2, min_count = 1). I have check and there is no collocations with ntokens > 2.

col[ntoken(col$collocation) > 2,]
#[1] collocation  count        count_nested length       lambda       z           
#<0 rows> (or 0-length row.names)

I am not sure how the outcomes look like that with some lengthy topic collocations with 3, 4, or even 5-grams.

Hope to hear from you soon!

Nhan

LDA_vis

koheiw commented 1 year ago

textstat_collocations() is not a function in this package, so please post your question to StackOverflow with the "quanteda" tag.

hahoangnhan commented 1 year ago

Hi @koheiw,

Can you recheck my example? It is not a problem of textstat_collocations as it generates only collocations with length = 2 (see my check below). But your textmodel_seededlda returns phrases with length = 3, for example, "protect_sensitive_information" or "global_market_trends" are returned by terms(slda).

My check:

col[ntoken(col$collocation) > 2,]
#[1] collocation  count        count_nested length       lambda       z           
#<0 rows> (or 0-length row.names)

Thank you in advance!

HHN

# load the required packages
library(quanteda)
library(quanteda.textstats)
library(seededlda)
library(LDAvis)

# Create a corpus from your text data
my_corpus <- corpus(c("The field of artificial intelligence is expanding with new breakthroughs.",
                      "Data analysis and machine learning are important in extracting insights.",
                      "Cybersecurity measures are crucial to protect sensitive information.",
                      "Health and fitness play a significant role in overall well-being.",
                      "The economy is affected by global market trends and trade policies.",
                      "Education is essential for personal growth and career opportunities.",
                      "Climate change and environmental sustainability are pressing issues.",
                      "Social media platforms are shaping communication and information sharing.",
                      "The entertainment industry is evolving with new digital platforms.",
                      "Urbanization and infrastructure development are transforming cities."))

# Define the seed words for each topic
seed_list <- list(topic1 = c("artificial intelligence", "machine learning", "data analysis"),
                  topic2 = c("cybersecurity", "privacy", "data protection"),
                  topic3 = c("climate change", "sustainability", "environmental impact")) 

dict <- dictionary(seed_list)

# Create a document-feature matrix with unigram and bigram features
toks <- tokens(my_corpus, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE) |>  
    tokens_remove(stopwords("en"), min_nchar = 2, padding = TRUE) # keep padding to avoid finding artificial ngrams

col <- textstat_collocations(toks, min_count = 1) # min_count should be much larger with large data
print(col)

 # > print(col)
# collocation count count_nested length   lambda        z
# 1       artificial intelligence     1            0      2 6.180017 2.856992
# 2          career opportunities     1            0      2 6.180017 2.856992
# 3                climate change     1            0      2 6.180017 2.856992
# 4        cybersecurity measures     1            0      2 6.180017 2.856992
# 5                 data analysis     1            0      2 6.180017 2.856992
# 6        entertainment industry     1            0      2 6.180017 2.856992
# 7  environmental sustainability     1            0      2 6.180017 2.856992
# 8           extracting insights     1            0      2 6.180017 2.856992
# 9                  fitness play     1            0      2 6.180017 2.856992
# 10                global market     1            0      2 6.180017 2.856992
# 11          information sharing     1            0      2 6.180017 2.856992
# 12   infrastructure development     1            0      2 6.180017 2.856992
# 13             machine learning     1            0      2 6.180017 2.856992
# 14                market trends     1            0      2 6.180017 2.856992
# 15           overall well-being     1            0      2 6.180017 2.856992
# 16              personal growth     1            0      2 6.180017 2.856992
# 17              pressing issues     1            0      2 6.180017 2.856992
# 18            protect sensitive     1            0      2 6.180017 2.856992
# 19        shaping communication     1            0      2 6.180017 2.856992
# 20             significant role     1            0      2 6.180017 2.856992
# 21                 social media     1            0      2 6.180017 2.856992
# 22               trade policies     1            0      2 6.180017 2.856992
# 23          transforming cities     1            0      2 6.180017 2.856992
# 24            digital platforms     1            0      2 5.068904 2.771130
# 25              media platforms     1            0      2 5.068904 2.771130
# 26            new breakthroughs     1            0      2 5.068904 2.771130
# 27                  new digital     1            0      2 5.068904 2.771130
# 28        sensitive information     1            0      2 5.068904 2.771130

toks <- toks |>
    tokens_compound(col[col$z > 2,]) |># z > 2 only statistically significant collocations
    tokens_compound(dict) # for multi-word expressions

dfmt <- dfm(toks, remove_padding = TRUE) |> # do not include paddings
    dfm_trim(min_termfreq = 0.3, termfreq_type = "quantile") # max_docfreq might remove too many words

# Run seeded LDA
slda <- textmodel_seededlda(dfmt, dict, residual = F) # consider using batch_size and auto_iter when data is large

# extract topic classification for each document 
topics <- topics(slda)

terms(slda)

 # topic1                          topic2                        
# [1,] "artificial_intelligence"       "field"                       
# [2,] "data_analysis"                 "cybersecurity_measures"      
# [3,] "machine_learning"              "health"                      
# [4,] "expanding"                     "economy"                     
# [5,] "new_breakthroughs"             **"global_market_trends"**        
# [6,] "important"                     "education"                   
# [7,] "protect_sensitive_information" "personal_growth"             
# [8,] "significant_role"              "career_opportunities"        
# [9,] "overall_well-being"            "environmental_sustainability"
# [10,] "shaping_communication"         "pressing_issues"             
# topic3                  
# [1,] "climate_change"        
# [2,] "extracting_insights"   
# [3,] "crucial"               
# [4,] "fitness_play"          
# [5,] "affected"              
# [6,] "trade_policies"        
# [7,] "essential"             
# [8,] "entertainment_industry"
# [9,] "evolving"              
# [10,] "new_digital_platforms" 
lda_vis <- createJSON(phi = slda$phi, 
                      theta = slda$theta,
                      doc.length = ntoken(slda$data),
                      vocab = featnames(slda$data),
                      term.frequency = featfreq(slda$data))
serVis(lda_vis)
koheiw commented 1 year ago

Again, this is not about seededlda., you need to set join = FALSE in tokens_compound().

toks <- toks |>
  tokens_compound(col[col$z > 2,], join = FALSE) |>
  tokens_compound(dict) # for multi-word expressions