juliasilge / tidytext

Text mining using tidy tools :sparkles::page_facing_up::sparkles:
https://juliasilge.github.io/tidytext/
Other
1.18k stars 182 forks source link

error on tidy() for STM estimateEffect object #166

Closed kathy-j-lee closed 4 years ago

kathy-j-lee commented 4 years ago

thanks for this package!

posting in hopes there might be a quick answer to a problem i've been running into. I've done a fair amount of searching but haven't come across an answer. (tried deleting .Rhistory and .RData, but no help)

I consistently get the following error when running tidy(estimateEffect object). Error in object$parameters[[topic]] : subscript out of bounds

apologies i don't have a fully reproducible example, but any tips/pointers would be greatly appreciated! thx!

m.p = stm(out.p$documents
          , out.p$vocab
          , data = out.p$meta
          , seed = 94114
          , K = 0
          , init.type = 'Spectral')
# topics is a vector of topic numbers (ints), length 14
ee.s = estimateEffect(topics[1] ~ reviewer.question
                      , m.p
                      , metadata = out.p$meta, uncertainty = 'Global')
tidy(ee.s)
juliasilge commented 4 years ago

Hello @kathy-j-lee! πŸ‘‹ I am sorry you are having trouble; that sounds frustrating. I just worked up this example, and I believe things are working as expected:

library(tidyverse)
library(tidytext)
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help. 
#>  Papers, resources, and other materials at structuraltopicmodel.com
library(janeaustenr)

books <- austen_books() %>%
  group_by(book) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, book, chapter, remove = FALSE)

austen_sparse <- books %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(document, word) %>%
  cast_sparse(document, word, n)
#> Joining, by = "word"

topic_model <- stm(
  austen_sparse, 
  K = 6,
  init.type = "Spectral",
  verbose = FALSE
)

summary(topic_model)
#> A topic model with 6 topics, 269 documents and a 13908 word dictionary.
#> Topic 1 Top Words:
#>       Highest Prob: elizabeth, darcy, bennet, jane, miss, bingley, time 
#>       FREX: darcy, bennet, bingley, wickham, collins, lydia, lizzy 
#>       Lift: offenses, entailed, ponies, bennets, corps, deigned, phillips's 
#>       Score: darcy, elizabeth, bennet, bingley, jane, wickham, lydia 
#> Topic 2 Top Words:
#>       Highest Prob: emma, miss, harriet, weston, knightley, elton, jane 
#>       FREX: emma, weston, knightley, elton, woodhouse, fairfax, churchill 
#>       Lift: bangs, broadway, brunswick, cleverer, curtseys, delicately, drizzle 
#>       Score: emma, weston, knightley, elton, woodhouse, harriet, fairfax 
#> Topic 3 Top Words:
#>       Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland 
#>       FREX: tilney, thorpe, morland, allen, eleanor, northanger, fullerton 
#>       Lift: heroic, france, gloucestershire, lid, thorpes, tilney, tilneys 
#>       Score: tilney, catherine, thorpe, morland, allen, isabella, eleanor 
#> Topic 4 Top Words:
#>       Highest Prob: elinor, marianne, time, sister, dashwood, mother, edward 
#>       FREX: elinor, marianne, dashwood, jennings, willoughby, brandon, ferrars 
#>       Lift: tumbling, waistcoat, westward, resembled, riches, enquire, margaret's 
#>       Score: elinor, marianne, dashwood, jennings, willoughby, lucy, brandon 
#> Topic 5 Top Words:
#>       Highest Prob: anne, captain, elliot, lady, wentworth, sir, charles 
#>       FREX: elliot, wentworth, walter, russell, musgrove, uppercross, kellynch 
#>       Lift: 1760, 1784, 1785, 1787, 1789, 1791, 1800 
#>       Score: elliot, wentworth, anne, walter, russell, musgrove, louisa 
#> Topic 6 Top Words:
#>       Highest Prob: fanny, crawford, miss, sir, edmund, time, thomas 
#>       FREX: crawford, edmund, bertram, norris, rushworth, mansfield, julia 
#>       Lift: bertrams, edmund's, _daughters_, _miss, adequately, attic, baronet's 
#>       Score: crawford, fanny, edmund, thomas, bertram, rushworth, norris

chapters <- books %>%
  group_by(document) %>% 
  summarize(text = str_c(text, collapse = " ")) %>%
  ungroup() %>%
  inner_join(books %>%
               distinct(document, book))
#> Joining, by = "document"

chapters
#> # A tibble: 269 x 3
#>    document text                                                           book 
#>    <chr>    <chr>                                                          <fct>
#>  1 Emma_1   "CHAPTER I   Emma Woodhouse, handsome, clever, and rich, with… Emma 
#>  2 Emma_10  "CHAPTER X   Though now the middle of December, there had yet… Emma 
#>  3 Emma_11  "CHAPTER XI   Mr. Elton must now be left to himself. It was n… Emma 
#>  4 Emma_12  "CHAPTER XII   Mr. Knightley was to dine with them--rather ag… Emma 
#>  5 Emma_13  "CHAPTER XIII   There could hardly be a happier creature in t… Emma 
#>  6 Emma_14  "CHAPTER XIV   Some change of countenance was necessary for e… Emma 
#>  7 Emma_15  "CHAPTER XV   Mr. Woodhouse was soon ready for his tea; and w… Emma 
#>  8 Emma_16  "CHAPTER XVI   The hair was curled, and the maid sent away, a… Emma 
#>  9 Emma_17  "CHAPTER XVII   Mr. and Mrs. John Knightley were not detained… Emma 
#> 10 Emma_18  "CHAPTER XVIII   Mr. Frank Churchill did not come. When the t… Emma 
#> # … with 259 more rows

effects <- estimateEffect(1:3 ~ book, topic_model, chapters)

summary(effects)
#> 
#> Call:
#> estimateEffect(formula = 1:3 ~ book, stmobj = topic_model, metadata = chapters)
#> 
#> 
#> Topic 1:
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)            0.017611   0.024589   0.716    0.475    
#> bookPride & Prejudice  0.792547   0.042627  18.593   <2e-16 ***
#> bookMansfield Park    -0.002047   0.035036  -0.058    0.953    
#> bookEmma               0.009849   0.035917   0.274    0.784    
#> bookNorthanger Abbey   0.002644   0.039941   0.066    0.947    
#> bookPersuasion         0.026252   0.045608   0.576    0.565    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> 
#> Topic 2:
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)            0.019051   0.020107   0.948    0.344    
#> bookPride & Prejudice  0.001541   0.026667   0.058    0.954    
#> bookMansfield Park    -0.004227   0.026910  -0.157    0.875    
#> bookEmma               0.880291   0.033919  25.953   <2e-16 ***
#> bookNorthanger Abbey   0.003975   0.032122   0.124    0.902    
#> bookPersuasion        -0.006641   0.030269  -0.219    0.827    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> 
#> Topic 3:
#> 
#> Coefficients:
#>                       Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)           0.018046   0.021297   0.847    0.398    
#> bookPride & Prejudice 0.003139   0.028206   0.111    0.911    
#> bookMansfield Park    0.017119   0.033818   0.506    0.613    
#> bookEmma              0.001544   0.031178   0.050    0.961    
#> bookNorthanger Abbey  0.870453   0.047399  18.365   <2e-16 ***
#> bookPersuasion        0.079599   0.050629   1.572    0.117    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tidy(effects)
#> # A tibble: 18 x 6
#>    topic term                  estimate std.error statistic  p.value
#>    <int> <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
#>  1     1 (Intercept)            0.0176     0.0243    0.722  4.71e- 1
#>  2     1 bookPride & Prejudice  0.793      0.0428   18.5    1.33e-49
#>  3     1 bookMansfield Park    -0.00215    0.0351   -0.0612 9.51e- 1
#>  4     1 bookEmma               0.00973    0.0358    0.272  7.86e- 1
#>  5     1 bookNorthanger Abbey   0.00273    0.0402    0.0680 9.46e- 1
#>  6     1 bookPersuasion         0.0262     0.0455    0.575  5.66e- 1
#>  7     2 (Intercept)            0.0189     0.0202    0.938  3.49e- 1
#>  8     2 bookPride & Prejudice  0.00185    0.0267    0.0692 9.45e- 1
#>  9     2 bookMansfield Park    -0.00424    0.0268   -0.158  8.74e- 1
#> 10     2 bookEmma               0.880      0.0345   25.6    3.16e-73
#> 11     2 bookNorthanger Abbey   0.00464    0.0324    0.143  8.86e- 1
#> 12     2 bookPersuasion        -0.00617    0.0302   -0.205  8.38e- 1
#> 13     3 (Intercept)            0.0178     0.0214    0.833  4.06e- 1
#> 14     3 bookPride & Prejudice  0.00370    0.0285    0.130  8.97e- 1
#> 15     3 bookMansfield Park     0.0172     0.0339    0.509  6.11e- 1
#> 16     3 bookEmma               0.00166    0.0313    0.0529 9.58e- 1
#> 17     3 bookNorthanger Abbey   0.871      0.0473   18.4    3.01e-49
#> 18     3 bookPersuasion         0.0804     0.0505    1.59   1.12e- 1

Created on 2020-03-07 by the reprex package (v0.3.0)

Can you try to put together a small reproducible example to demonstrate what is leading to this problem?

kathy-j-lee commented 4 years ago

thank you for the example, and apologies for the delay responding. i ended up wrapping up the project using tidystm::extract.estimateEffect(), but i suspect it's a bug(?) with the stm summary function on estimateEffect objects. summary.estimateEffect can't seem to process EE models with more than 10 topics...

juliasilge commented 4 years ago

Hmmm, it looks like to me like everything works fine for models with more than 10 topics:

library(tidyverse)
library(tidytext)
library(janeaustenr)
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help. 
#>  Papers, resources, and other materials at structuraltopicmodel.com

books <- austen_books() %>%
    group_by(book) %>%
    mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
    ungroup() %>%
    filter(chapter > 0) %>%
    unite(document, book, chapter, remove = FALSE)

austen_sparse <- books %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    count(document, word) %>%
    cast_sparse(document, word, n)
#> Joining, by = "word"

topic_model <- stm(
    austen_sparse, 
    K = 12,
    init.type = "Spectral",
    verbose = FALSE
)

summary(topic_model)
#> A topic model with 12 topics, 269 documents and a 13908 word dictionary.
#> Topic 1 Top Words:
#>       Highest Prob: bennet, elizabeth, bingley, jane, miss, dear, darcy 
#>       FREX: bennet, bingley, lizzy, lucas, collins, netherfield, daughters 
#>       Lift: develop, morris, rightful, vexing, adhered, bass, crushing 
#>       Score: bennet, bingley, elizabeth, darcy, collins, jane, lizzy 
#> Topic 2 Top Words:
#>       Highest Prob: emma, weston, knightley, miss, harriet, time, elton 
#>       FREX: weston, knightley, martin, hartfield, randalls, emma, weston's 
#>       Lift: bangs, broadway, cleverer, curtseys, delicately, drizzle, hannah 
#>       Score: emma, weston, knightley, elton, harriet, woodhouse, hartfield 
#> Topic 3 Top Words:
#>       Highest Prob: catherine, tilney, miss, time, isabella, thorpe, morland 
#>       FREX: northanger, thorpe, tilney, eleanor, morland, isabella, allen 
#>       Lift: cursed, blaize, edifice, average, putney, convent, cabinet 
#>       Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor 
#> Topic 4 Top Words:
#>       Highest Prob: fanny, time, sir, mansfield, edmund, thomas, house 
#>       FREX: susan, betsey, mansfield, portsmouth, price, norris, thomas 
#>       Lift: wimpole, _daughters_, _miss, adequately, bertram_, bewailing, depressing 
#>       Score: fanny, edmund, mansfield, susan, bertram, crawford, thomas 
#> Topic 5 Top Words:
#>       Highest Prob: elizabeth, darcy, miss, lady, bingley, collins, catherine 
#>       FREX: rosings, darcy, fitzwilliam, collins, darcy's, de, ladyship 
#>       Lift: _persuasion_, appertain, blots, cheating, converting, expostulation, grantley's 
#>       Score: darcy, elizabeth, bingley, collins, bennet, wickham, catherine 
#> Topic 6 Top Words:
#>       Highest Prob: fanny, crawford, miss, sir, edmund, thomas, time 
#>       FREX: grant, julia, thomas, crawford, sotherton, yates, rushworth 
#>       Lift: _rencontre_, accidents, ague, cheeses, chit, circuitous, coop 
#>       Score: crawford, fanny, edmund, bertram, thomas, rushworth, norris 
#> Topic 7 Top Words:
#>       Highest Prob: anne, captain, elliot, lady, wentworth, sir, charles 
#>       FREX: elliot, walter, russell, uppercross, kellynch, lyme, henrietta 
#>       Lift: 1760, 1784, 1785, 1787, 1789, 1791, 1800 
#>       Score: elliot, wentworth, anne, walter, russell, captain, musgrove 
#> Topic 8 Top Words:
#>       Highest Prob: miss, emma, jane, fairfax, weston, elton, knightley 
#>       FREX: fairfax, bates, campbell, cole, dixon, weston, churchill 
#>       Lift: patty, _joint_, 7th, baly, beaufet, checker, craig 
#>       Score: emma, fairfax, jane, weston, elton, knightley, woodhouse 
#> Topic 9 Top Words:
#>       Highest Prob: elinor, marianne, time, sister, edward, dashwood, miss 
#>       FREX: marianne, willoughby, brandon, palmer, elinor, jennings, ferrars 
#>       Lift: allenham, assigned, authors, gloominess, hardily, mohrs, mosquitoes 
#>       Score: elinor, marianne, jennings, dashwood, willoughby, lucy, edward 
#> Topic 10 Top Words:
#>       Highest Prob: harriet, emma, miss, elton, woodhouse, dear, time 
#>       FREX: harriet, charade, harriet's, highbury, woodhouse, elton, martin 
#>       Lift: charade, gipsies, ahead, ajar, angles, beet, blinder 
#>       Score: harriet, emma, elton, woodhouse, harriet's, knightley, charade 
#> Topic 11 Top Words:
#>       Highest Prob: elizabeth, jane, wickham, lydia, time, letter, sister 
#>       FREX: lydia, gardiner, wickham, lydia's, forster, brighton, longbourn 
#>       Lift: regiment's, hackney, gardiners, abound, achieving, bewailed, caroline's 
#>       Score: wickham, elizabeth, lydia, darcy, bennet, jane, gardiner 
#> Topic 12 Top Words:
#>       Highest Prob: fanny, miss, crawford, edmund, time, catherine, read 
#>       FREX: chapel, reading, udolpho, allen, chain, clergyman, journal 
#>       Lift: pulpit, assembling, mysteries, _if_, _possible_, cave, champion 
#>       Score: fanny, crawford, edmund, catherine, allen, rushworth, morland

chapters <- books %>%
    group_by(document) %>% 
    summarize(text = str_c(text, collapse = " ")) %>%
    ungroup() %>%
    inner_join(books %>%
                   distinct(document, book))
#> Joining, by = "document"

chapters
#> # A tibble: 269 x 3
#>    document text                                                           book 
#>    <chr>    <chr>                                                          <fct>
#>  1 Emma_1   "CHAPTER I   Emma Woodhouse, handsome, clever, and rich, with… Emma 
#>  2 Emma_10  "CHAPTER X   Though now the middle of December, there had yet… Emma 
#>  3 Emma_11  "CHAPTER XI   Mr. Elton must now be left to himself. It was n… Emma 
#>  4 Emma_12  "CHAPTER XII   Mr. Knightley was to dine with them--rather ag… Emma 
#>  5 Emma_13  "CHAPTER XIII   There could hardly be a happier creature in t… Emma 
#>  6 Emma_14  "CHAPTER XIV   Some change of countenance was necessary for e… Emma 
#>  7 Emma_15  "CHAPTER XV   Mr. Woodhouse was soon ready for his tea; and w… Emma 
#>  8 Emma_16  "CHAPTER XVI   The hair was curled, and the maid sent away, a… Emma 
#>  9 Emma_17  "CHAPTER XVII   Mr. and Mrs. John Knightley were not detained… Emma 
#> 10 Emma_18  "CHAPTER XVIII   Mr. Frank Churchill did not come. When the t… Emma 
#> # … with 259 more rows

effects <- estimateEffect(1:3 ~ book, topic_model, chapters)

summary(effects)
#> 
#> Call:
#> estimateEffect(formula = 1:3 ~ book, stmobj = topic_model, metadata = chapters)
#> 
#> 
#> Topic 1:
#> 
#> Coefficients:
#>                         Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)            0.0131655  0.0312294   0.422    0.674    
#> bookPride & Prejudice  0.2753304  0.0461246   5.969 7.68e-09 ***
#> bookMansfield Park    -0.0005447  0.0451641  -0.012    0.990    
#> bookEmma               0.0017084  0.0434592   0.039    0.969    
#> bookNorthanger Abbey   0.0039354  0.0494124   0.080    0.937    
#> bookPersuasion         0.0138348  0.0561217   0.247    0.805    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> 
#> Topic 2:
#> 
#> Coefficients:
#>                         Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)            0.0156841  0.0315116   0.498    0.619    
#> bookPride & Prejudice  0.0003968  0.0419536   0.009    0.992    
#> bookMansfield Park    -0.0027083  0.0446814  -0.061    0.952    
#> bookEmma               0.4149668  0.0519175   7.993 4.18e-14 ***
#> bookNorthanger Abbey  -0.0038303  0.0498493  -0.077    0.939    
#> bookPersuasion        -0.0055070  0.0542854  -0.101    0.919    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> 
#> Topic 3:
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)           1.107e-02  2.187e-02   0.506    0.613    
#> bookPride & Prejudice 4.297e-03  2.962e-02   0.145    0.885    
#> bookMansfield Park    2.513e-03  3.109e-02   0.081    0.936    
#> bookEmma              3.151e-03  3.153e-02   0.100    0.920    
#> bookNorthanger Abbey  7.011e-01  5.973e-02  11.738   <2e-16 ***
#> bookPersuasion        1.204e-05  3.670e-02   0.000    1.000    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tidy(effects)
#> # A tibble: 18 x 6
#>    topic term                   estimate std.error statistic  p.value
#>    <int> <chr>                     <dbl>     <dbl>     <dbl>    <dbl>
#>  1     1 (Intercept)            0.0133      0.0308    0.430  6.67e- 1
#>  2     1 bookPride & Prejudice  0.275       0.0456    6.04   5.30e- 9
#>  3     1 bookMansfield Park    -0.00120     0.0442   -0.0272 9.78e- 1
#>  4     1 bookEmma               0.00111     0.0433    0.0257 9.80e- 1
#>  5     1 bookNorthanger Abbey   0.00354     0.0496    0.0713 9.43e- 1
#>  6     1 bookPersuasion         0.0141      0.0558    0.253  8.01e- 1
#>  7     2 (Intercept)            0.0156      0.0317    0.493  6.22e- 1
#>  8     2 bookPride & Prejudice  0.000426    0.0417    0.0102 9.92e- 1
#>  9     2 bookMansfield Park    -0.00287     0.0449   -0.0639 9.49e- 1
#> 10     2 bookEmma               0.415       0.0526    7.90   7.87e-14
#> 11     2 bookNorthanger Abbey  -0.00325     0.0502   -0.0647 9.48e- 1
#> 12     2 bookPersuasion        -0.00574     0.0539   -0.106  9.15e- 1
#> 13     3 (Intercept)            0.0114      0.0219    0.519  6.04e- 1
#> 14     3 bookPride & Prejudice  0.00388     0.0297    0.131  8.96e- 1
#> 15     3 bookMansfield Park     0.00229     0.0309    0.0742 9.41e- 1
#> 16     3 bookEmma               0.00310     0.0314    0.0988 9.21e- 1
#> 17     3 bookNorthanger Abbey   0.701       0.0601   11.7    1.37e-25
#> 18     3 bookPersuasion         0.000420    0.0370    0.0113 9.91e- 1

Created on 2020-03-26 by the reprex package (v0.3.0)

But I am glad to hear that you found another way to solve your problem!

ghkoo commented 2 years ago

Hi Julia,

I followed your codes but I encountered an error that says: Error in qr.lm(thetasims[, k], qx) : number of covariate observations does not match number of docs.

On my dataset (full_data), I have a column named partisan_media (my covariate), 1=conservative, 2=liberal. I grouped the full_article column into these two groups (liberal/conservative). I have 7 topics; so I put 1:7. But the error occurred when running estimateEffect().

My codes are: partisan_media1 <- full_data %>% group_by(partisan_media, full_article) %>% summarize(full_article = str_c(full_article)) %>% ungroup() %>% inner_join(full_data %>% distinct(partisan_media))

partisan_media1

effects <- estimateEffect(1:7 ~ partisan_media, news_stm, partisan_media1)

Do you have any idea why this error is occurring? I would appreciate any comments or suggestions. Thank you!

juliasilge commented 2 years ago

@kgh21 Can you create a reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it. I also recommend opening a new issue, and not commenting on an old, closed issue.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! πŸ™Œ

ghkoo commented 2 years ago

Thank you for your reply and for your help! I just installed the reprex() and pasted my output here:


full_data <- read_csv("~/Desktop/UT/eight.csv")
#> Rows: 4042 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): publish_date, media_name, full_article
#> dbl (2): number, partisan_media
#> 
#> β„Ή Use `spec()` to retrieve the full column specification for this data.
#> β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.

options(scipen=999)
news_processed <- textProcessor(full_data$full_article, 
                                 metadata = full_data,
                                customstopwords = c("said", "don't", "will", "like", "use", "can", "'re", "one", "get", "know", "new", "told", "accord", "don’t", "’re", "according", "show", "say", "people", "report", "vaccine", "vaccinate", "vaccination", "coronavirus", "covid-19", "COVID", "just", "want", "think", "now", "make", "time", "come", "back", "say", "see", "read" ),
                                 lowercase = TRUE,
                                 striphtml = TRUE)
#> Building corpus... 
#> Converting to Lower Case... 
#> Removing punctuation... 
#> Removing stopwords... 
#> Remove Custom Stopwords...
#> Removing numbers... 
#> Stemming... 
#> Creating Output...

out <- prepDocuments(news_processed$documents, news_processed$vocab, 
                     news_processed$meta, lower.thresh = 20) #lower.thresh = 20 means that any word appearing in fewer than 20 documents would be automatically excluded from our analysis
#> Removing 77084 of 81563 terms (158955 of 844653 tokens) due to frequency 
#> Your corpus now has 3206 documents, 4479 terms and 685698 tokens.
docs <- out$documents
vocab <- out$vocab
meta <- out$meta

news_stm <- stm(documents = out$documents, vocab = out$vocab,
                       K = 7, #state the 7 number of topics we want
                       max.em.its = 80, 
                       data = out$meta,
                       init.type = "Spectral", 
                       seed = 100)

colnames(full_data)
#> [1] "number"         "publish_date"   "media_name"     "partisan_media"
#> [5] "full_article"
table(full_data$partisan_media)
#> 
#>    1    2 
#> 1866 2174

partisan_media1 <- full_data %>%
    group_by(partisan_media, full_article) %>% 
    summarize(full_article = str_c(full_article)) %>%
    ungroup() %>%
    inner_join(full_data %>%
                   distinct(partisan_media))
#> `summarise()` has grouped output by 'partisan_media', 'full_article'. You can override using the `.groups` argument.
#> Joining, by = "partisan_media"

summary(partisan_media1)
#>  partisan_media  full_article      
#>  Min.   :1.000   Length:4042       
#>  1st Qu.:1.000   Class :character  
#>  Median :2.000   Mode  :character  
#>  Mean   :1.538                     
#>  3rd Qu.:2.000                     
#>  Max.   :2.000                     
#>  NA's   :2

effects <- estimateEffect(1:7 ~ partisan_media, news_stm, partisan_media1)
#> Error in qr.lm(thetasims[, k], qx): number of covariate observations does not match number of docs

Created on 2021-11-30 by the reprex package (v2.0.1)

juliasilge commented 2 years ago

@kgh21 There are a couple of things here for people to be able to help you:

Unfortunately we can't tell much from what you have pasted here because we don't have access to your data, other than what the error says, "number of covariate observations does not match number of docs". Sounds like what you are passing in as covariates doesn't have the same number of documents as what was in your model.

ghkoo commented 2 years ago

Aha, thanks for the tip. I will make sure to open a new issue. I think I just solved my problem by using a different approach (topicLasso function) to include a predictor. Thank you!

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.