quanteda / quanteda.tidy

Tidyverse extensions for quanteda
31 stars 0 forks source link

How to left join docvars with those in an existing corpus #7

Open conjugateprior opened 7 years ago

conjugateprior commented 7 years ago

This works

> str(inaugCorpus) # but deprecated
List of 4
 $ documents:'data.frame':  58 obs. of  1 variable:
  ..$ texts: chr [1:58] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the o"| __truncated__ "When it was first perceived, in early times, that no middle course for America remained between unlimited submission to a forei"| __truncated__ "Friends and Fellow Citizens:\n\nCalled upon to undertake the duties of the first executive office of our country, I avail mysel"| __truncated__ ...
 $ metadata :'data.frame':  58 obs. of  1 variable:
  ..$ Year: num [1:58] 1789 1793 1797 1801 1805 ...
 $ settings :'data.frame':  58 obs. of  1 variable:
  ..$ President: chr [1:58] "Washington" "Washington" "Adams" "Jefferson" ...
 $ tokens   :'data.frame':  58 obs. of  1 variable:
  ..$ FirstName: chr [1:58] "George" "George" "John" "Thomas" ...
 - attr(*, "class")= chr [1:2] "corpus" "list"

but this doesn't

> str(corpus(data_char_inaugural))
Error in `[[.corpus`(object, 1L) : 
  cannot index docvars this way because none exist

apparently because there are no docvars

> str(corpus(data_char_inaugural, docvars = docvars(inaugCorpus)))
List of 4
 $ documents:'data.frame':  58 obs. of  1 variable:
  ..$ texts: chr [1:58] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the o"| __truncated__ "When it was first perceived, in early times, that no middle course for America remained between unlimited submission to a forei"| __truncated__ "Friends and Fellow Citizens:\n\nCalled upon to undertake the duties of the first executive office of our country, I avail mysel"| __truncated__ ...
 $ metadata :'data.frame':  58 obs. of  1 variable:
  ..$ Year: num [1:58] 1789 1793 1797 1801 1805 ...
 $ settings :'data.frame':  58 obs. of  1 variable:
  ..$ President: chr [1:58] "Washington" "Washington" "Adams" "Jefferson" ...
 $ tokens   :'data.frame':  58 obs. of  1 variable:
  ..$ FirstName: chr [1:58] "George" "George" "John" "Thomas" ...
 - attr(*, "class")= chr [1:2] "corpus" "list"

Seems like it should be possible to make a docvar-free corpus though.

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readtext_0.2.9000 quanteda_0.9.9-24

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9         lattice_0.20-34     deldir_0.1-12      
 [4] png_0.1-7           class_7.3-14        gtools_3.5.0       
 [7] digest_0.6.12       foreach_1.4.3       V8_1.2             
[10] R6_2.2.0            plyr_1.8.4          tmap_1.8-1         
[13] stats4_3.3.2        coda_0.19-1         e1071_1.6-8        
[16] httr_1.2.1          spdep_0.6-9         curl_2.3           
[19] data.table_1.10.0   gdata_2.17.0        geosphere_1.5-5    
[22] raster_2.5-8        gmodels_2.16.2      R.utils_2.5.0      
[25] R.oo_1.21.0         Matrix_1.2-7.1      splines_3.3.2      
[28] webshot_0.4.0       rgdal_1.2-5         htmlwidgets_0.8    
[31] RCurl_1.95-4.8      munsell_0.4.3       rmapshaper_0.1.0   
[34] tmaptools_1.2       rgeos_0.3-22        htmltools_0.3.5    
[37] codetools_0.2-15    mapview_1.2.0       XML_3.98-1.5       
[40] viridisLite_0.1.3   MASS_7.3-45         bitops_1.0-6       
[43] R.methodsS3_1.7.1   grid_3.3.2          nlme_3.1-128       
[46] jsonlite_1.2        satellite_0.2.0     magrittr_1.5       
[49] scales_0.4.1        RcppParallel_4.3.20 KernSmooth_2.23-15 
[52] stringi_1.1.2       LearnBayes_2.15     leaflet_1.0.1      
[55] sp_1.2-4            ca_0.64             latticeExtra_0.6-28
[58] boot_1.3-18         fastmatch_1.1-0     osmar_1.1-7        
[61] RColorBrewer_1.1-2  iterators_1.0.8     tools_3.3.2        
[64] gdalUtils_2.0.1.7   dichromat_2.0-0     colorspace_1.3-2   
[67] classInt_0.1-23    
kbenoit commented 7 years ago

Thanks. More generally (and basically):

str(corpus("this is my single document"))
## Error in `[[.corpus`(object, 1L) : 
##  cannot index docvars this way because none exist 
kbenoit commented 7 years ago

But keep in mind this, from ?corpus:

A warning on accessing corpus elements

A corpus currently consists of an S3 specially classed list of elements, but you should not access these elements directly. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change (as it inevitably will as we continue to develop the package, including moving corpus objects to the S4 class system).

😉

kbenoit commented 7 years ago

@conjugateprior Refresh with the latest GitHub version and try it now.

conjugateprior commented 7 years ago

FYI I was string in the first place so I could sketch out a corpus_merge_docvars function that I have now needed several times and hacked around. (Something like tmaptools::append_data). If you're planning such a function, let me know and I won't duplicate the work.

In the meantime I'll wait until the innards settle down.

kbenoit commented 7 years ago

Have you seen the + and c methods for the corpus class? Might be what you are after.

corpus1 <- corpus_subset(data_corpus_inaugural, President == "Bush")
corpus2 <- corpus_subset(data_corpus_inaugural, President == "Clinton")
docvars(corpus2, "newvar") <- "Added to Clinton"
corpus3 <- corpus_subset(data_corpus_inaugural, President == "Obama")
docvars(corpus3, "newvar") <- "Added to Obama"

docvars(c(corpus1, corpus2, corpus3))
##              Year President FirstName           newvar
## 1989-Bush    1989      Bush    George             <NA>
## 2001-Bush    2001      Bush George W.             <NA>
## 2005-Bush    2005      Bush George W.             <NA>
## 1993-Clinton 1993   Clinton      Bill Added to Clinton
## 1997-Clinton 1997   Clinton      Bill Added to Clinton
## 2009-Obama   2009     Obama    Barack   Added to Obama
## 2013-Obama   2013     Obama    Barack   Added to Obama

docvars(corpus2 + corpus3)
##              Year President FirstName           newvar
## 1993-Clinton 1993   Clinton      Bill Added to Clinton
## 1997-Clinton 1997   Clinton      Bill Added to Clinton
## 2009-Obama   2009     Obama    Barack   Added to Obama
## 2013-Obama   2013     Obama    Barack   Added to Obama

If not, consider a PR that operates using accessor functions (try methods(class = "corpus") for a list), or just describe what you are looking for and we could add it.

conjugateprior commented 7 years ago

Definitely not + or c.

As in the SpatialPolygonDataFrame function I linked to above it's about having maybe incomplete or overcomplete hand constructed document metadata in a data.frame and (left) joining it with a corpus object via a key that is a corpus docvar on the left side, and regular data.frame column on the right side.

Currently it seems one must have the external metadata go in column by column and hope it lines up with the exact ordering of documents in the corpus. This has bitten me several times already. Hence the desire for a merge-like function rather than a cbind-like function to do that.

kbenoit commented 7 years ago

Well, we could modify + for signature corpus, data.frame so that it performs a left join automatically based on the docname as a key. But first let me make sure I have understood.

You want following:

> docvars(data_corpus_irishbudget2010)
                                      year debate number      foren     name party
2010_BUDGET_01_Brian_Lenihan_FF       2010 BUDGET     01      Brian  Lenihan    FF
2010_BUDGET_02_Richard_Bruton_FG      2010 BUDGET     02    Richard   Bruton    FG
2010_BUDGET_03_Joan_Burton_LAB        2010 BUDGET     03       Joan   Burton   LAB
2010_BUDGET_04_Arthur_Morgan_SF       2010 BUDGET     04     Arthur   Morgan    SF
2010_BUDGET_05_Brian_Cowen_FF         2010 BUDGET     05      Brian    Cowen    FF
2010_BUDGET_06_Enda_Kenny_FG          2010 BUDGET     06       Enda    Kenny    FG
2010_BUDGET_07_Kieran_ODonnell_FG     2010 BUDGET     07     Kieran ODonnell    FG
2010_BUDGET_08_Eamon_Gilmore_LAB      2010 BUDGET     08      Eamon  Gilmore   LAB
2010_BUDGET_09_Michael_Higgins_LAB    2010 BUDGET     09    Michael  Higgins   LAB
2010_BUDGET_10_Ruairi_Quinn_LAB       2010 BUDGET     10     Ruairi    Quinn   LAB
2010_BUDGET_11_John_Gormley_Green     2010 BUDGET     11       John  Gormley Green
2010_BUDGET_12_Eamon_Ryan_Green       2010 BUDGET     12      Eamon     Ryan Green
2010_BUDGET_13_Ciaran_Cuffe_Green     2010 BUDGET     13     Ciaran    Cuffe Green
2010_BUDGET_14_Caoimhghin_OCaolain_SF 2010 BUDGET     14 Caoimhghin OCaolain    SF

> (df_tomerge <- data.frame(minister = c(1, 1), row.names = c("2010_BUDGET_01_Brian_Lenihan_FF", "2010_BUDGET_11_John_Gormley_Green")))
                                  minister
2010_BUDGET_01_Brian_Lenihan_FF          1
2010_BUDGET_11_John_Gormley_Green        1

## MERGE COMMAND

## RESULT:
                                      year debate number      foren     name party minister
2010_BUDGET_01_Brian_Lenihan_FF       2010 BUDGET     01      Brian  Lenihan    FF        1
2010_BUDGET_02_Richard_Bruton_FG      2010 BUDGET     02    Richard   Bruton    FG       NA
2010_BUDGET_03_Joan_Burton_LAB        2010 BUDGET     03       Joan   Burton   LAB       NA
2010_BUDGET_04_Arthur_Morgan_SF       2010 BUDGET     04     Arthur   Morgan    SF       NA
2010_BUDGET_05_Brian_Cowen_FF         2010 BUDGET     05      Brian    Cowen    FF       NA
2010_BUDGET_06_Enda_Kenny_FG          2010 BUDGET     06       Enda    Kenny    FG       NA
2010_BUDGET_07_Kieran_ODonnell_FG     2010 BUDGET     07     Kieran ODonnell    FG       NA
2010_BUDGET_08_Eamon_Gilmore_LAB      2010 BUDGET     08      Eamon  Gilmore   LAB       NA
2010_BUDGET_09_Michael_Higgins_LAB    2010 BUDGET     09    Michael  Higgins   LAB       NA
2010_BUDGET_10_Ruairi_Quinn_LAB       2010 BUDGET     10     Ruairi    Quinn   LAB       NA
2010_BUDGET_11_John_Gormley_Green     2010 BUDGET     11       John  Gormley Green        1
2010_BUDGET_12_Eamon_Ryan_Green       2010 BUDGET     12      Eamon     Ryan Green       NA
2010_BUDGET_13_Ciaran_Cuffe_Green     2010 BUDGET     13     Ciaran    Cuffe Green       NA
2010_BUDGET_14_Caoimhghin_OCaolain_SF 2010 BUDGET     14 Caoimhghin OCaolain    SF       NA
conjugateprior commented 7 years ago

Yes, that would do it.

Two small caveats.

  1. Seems awkward to be required to key on rownames, but that's a minor thing. I guess it ensures they're unique :-)
  2. Making + non-commutative looks like trouble, unless you're thinking of type-distinguished (corpus, dataframe) and (data.frame, corpus) implementations of it.
kbenoit commented 7 years ago

OK, thinking about options for syntax:

  1. It could qualify for the corpus_something() grammar since it takes a corpus as the main argument, and returns a modified corpus. Something like:

    corpus_joinvars(thecorpus, newdocvars_data.frame, by = NULL)

    where the default is to join by docnames() (and row.names for the data.frame), but can be set in the same way that dplyr::left_join() works.

  2. Since it sets docvars for a corpus (through a left join), it might be more appropriate to be a variant of the docvars() command. For instance:

    docvars(thecorpus, merge_source = newdocvars_data.frame, by = NULL)

    or maybe some clever adaptation of the <-.docvars() function?

kbenoit commented 7 years ago

How about this:

Using S4 methods with multiple dispatch will allow us to distinguish these two methods (even with S3 objects). Order from chaos.

conjugateprior commented 7 years ago

Four questions and proposed answers for the semantics of + with corpus 'corp' and data.frame 'newdocvars'.

  1. Are matches determined exclusively (keyed on) the rownames of corpus and data.frame?
  2. Does + left join, ignoring docvars for which there is no corpus document?
  3. Does + overwrite values of create a new renamed variable for the intersection of colnames(docvars(corp)) and colnames(newdocvars)?
  4. Does + maintaindocvars(corp) variable classes? (factor is the only hard case)

Proposal:

  1. Yes (since you've decided to do this elsewhere)
  2. Yes
  3. Yes. Specifically, option iii of the following:
    1. making a new docvar column with an adjusted name (messy and prevents building up docvar values in stages or partially)
    2. only overwriting when the old corpus docvar has an NA in place (silent and confusing)
    3. overwriting all elements of the old corpus docvar (simplest semantics, allows building up docvar values in stages or partially)
    4. overwriting all elements of the old corpus docvar unless newdocvar value is NA in which case keeping the old corpus var (asymmetric semantics but otherwise like iii)
  4. Yes. Cases to consider are all combinations of numeric, character, and factor types for old corpus docvars and newdocvars. Abbreviate them N, F, and S so <N,F> is an originally numeric corpus docvar meeting the factor in newdocvars that shares its name.
    1. <N,N> overwrite as above, keep class N
    2. <N,F> complain and stop
    3. <N,C> complain and stop
    4. <C,N> complain and stop
    5. <C,F> convert F to C and overwrite as above, keeping class C
    6. <C,C> overwrite as above, keep class C
    7. <F,N> complain and stop
    8. <F,F> convert both to character, overwrite as above, convert result to F (possibly creating and destroying labels) 9, <F,C> convert F to C, overwrite as above, convert result to F (possibly creating and destroying labels)

Some discussion of the semantics factor conversion would be useful.

conjugateprior commented 7 years ago

Second suggestion: All this goes into an augmented docvars command instead: docvars(corp) <- newdocvars. All the same questions would need anwering for this, so it seems to be an orthogonal question.

conjugateprior commented 7 years ago

@kbenoit Thoughts on these semantics or should I assume they're fine and send a PR?

kbenoit commented 7 years ago

Insofar as I understood it fully, let's implement your answers to the scheme above. I'd say that the docvars class should be the left side, i.e. the existing variable, and if this is not compatible in the ways you list, then complain and stop.

You mention a PR - great if you code this!

kbenoit commented 6 years ago

Update: The solution to this could be part of quanteda/quanteda#1214. It could also be solved by the idea of creating a quanteda.dplyr extension package as described in quanteda/quanteda#1171, quanteda/quanteda#529.

kbenoit commented 4 years ago

@conjugateprior with the new package this should be pretty easy to implement now. I'm adding it to the list.

mpazpiroz commented 2 years ago

@kbenoit I was wondering if there is a solution to the question in this thread? I have been unsuccessful in trying to do add external variables to a corpus object. Thanks!