kbenoit / wordshoal

quanteda implementation of the Lauderdale and Herzog (2016) "Wordshoal" model
13 stars 2 forks source link

Wordshoal robustness to arguments #5

Open kbenoit opened 6 years ago

kbenoit commented 6 years ago

Issue by kwainfan Monday Jul 10, 2017 at 22:47 GMT Originally opened as https://github.com/kbenoit/quanteda/issues/845


I am getting an error when I try to run a wordshoal model.

> t88<-readtext(file="~
/hoc.corpus88.csv",text_field = "text",docvarsfrom = "filenames")
Read 63956 rows and 12 (of 12) columns from 0.073 GB file in 00:00:05
> 
> corp88<- corpus(t88)
> 
> dfm88<-dfm(corp88,remove=c(stopwords("english"),removePunct=TRUE),stem = TRUE)
> 
> shoaltm<- textmodel_wordshoal(dfm88,groups=docvars(dfm88,c("party","country")),authors=docvars(dfm88,"name"))
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

data: hoc.corpus88.zip

kbenoit commented 6 years ago

Comment by kbenoit Tuesday Jul 11, 2017 at 07:46 GMT


Here you have a problem with syntax issues, but also with your groups.

t88 <- readtext::readtext(file = "~/Downloads/hoc.corpus88.zip",
                          text_field = "text",
                          docvarsfrom = "filenames")
corp88 <- corpus(t88)
dfm88 <- dfm(corp88, 
             remove = c(stopwords("english")), 
             remove_punct = TRUE, 
             stem = TRUE)

The problem with your syntax above was mis-matched parentheses.

Note also that in your wordshoal call you need to combine the two docvars into one, as per below, for the groups, and you need to use the correct variable name for the author field.

head(docvars(dfm88))
#                    V1 X       date session speechnumber      speaker party chair terms parliament country
# hoc.corpus88.csv.1  1 1 1988-11-22 1988-89            1 HOUSESPEAKER other  TRUE  1068     UK-HoC    <NA>
# hoc.corpus88.csv.2  2 2 1988-11-22 1988-89            2 HOUSESPEAKER other  TRUE    61     UK-HoC    <NA>
# hoc.corpus88.csv.3  3 3 1988-11-22 1988-89            3   Giles Shaw   Con FALSE  2514     UK-HoC    <NA>
# hoc.corpus88.csv.4  4 4 1988-11-22 1988-89            4  John Maples   Con FALSE  1490     UK-HoC England
# hoc.corpus88.csv.5  5 5 1988-11-22 1988-89            5 Neil Kinnock   Lab FALSE  2775     UK-HoC   Wales
# hoc.corpus88.csv.6  6 6 1988-11-22 1988-89            6 David Harris   Con FALSE     1     UK-HoC    <NA>
#                         docvar1
# hoc.corpus88.csv.1 hoc.corpus88
# hoc.corpus88.csv.2 hoc.corpus88
# hoc.corpus88.csv.3 hoc.corpus88
# hoc.corpus88.csv.4 hoc.corpus88
# hoc.corpus88.csv.5 hoc.corpus88
# hoc.corpus88.csv.6 hoc.corpus88

shoaltm <- textmodel_wordshoal(dfm88, 
                               groups = interaction(docvars(dfm88, c("party", "country"))), 
                               authors = docvars(dfm88, "speaker"))
# Error in textmodel_wordshoal.dfm(dfm88, groups = interaction(docvars(dfm88,  : 
#   only a single case for the following groups: 
# DUP.England
# SDLP.England
# SNP.England
# UPUP.England
# UUP.England
# Con.Northern Ireland
# Lab.Northern Ireland
# LibDem.Northern Ireland
# other.Northern Ireland
# PlaidCymru.Northern Ireland
# SDP.Northern Ireland
# SNP.Northern Ireland
# DUP.Scotland
# PlaidCymru.Scotland
# SDLP.Scotland
# SDP.Scotland
# UPUP.Scotland
# UUP.Scotland
# DUP.Wales
# PlaidCymru.Wales
# SDLP.Wales
# SDP.Wales
# SNP.Wales
# UPUP.Wales
# UUP.Wales 

But unfortunately here you have too few authors, so need to pare them. You can do this before creating the dfm using corpus_subset(), or you can trim them using index slicing from the dfm().

The person to make textmodel_wordshoal() more robust to these issues is @lauderdale and I am hoping he'll get to them this summer at some point.

kbenoit commented 6 years ago

Comment by methodds Wednesday Aug 09, 2017 at 17:24 GMT


did you change anything for wordshoal during the last few quanteda versions? I'm receiving a lot of warnings for a corpus which did not happen before:

......Warning: The algorithm did not converge..............Warning: The algorithm did not converge.
20 .................Warning: The algorithm did not converge..
.40 ...................60 ...................80 .....
Warning: The algorithm did not converge...............
100 .....Warning: The algorithm did not converge..Warning: 
The algorithm did not converge..............120 ............Warning: The algorithm did not converge........

...

If you didn't change anything: Is is possible that changes for dfm() affect wordshoal in an unintentional way?

kbenoit commented 6 years ago

Comment by kbenoit Wednesday Aug 09, 2017 at 20:23 GMT


We added a warning when the algorithm reached the iteration limit in the Wordfish routine that it calls, but otherwise the behaviour should be the same. textmodel_wordshoal() remains an experimental function, but I am hoping that @lauderdale will devote some time soon to making it more robust.