Open conjugateprior opened 7 years ago
Thanks. More generally (and basically):
str(corpus("this is my single document"))
## Error in `[[.corpus`(object, 1L) :
## cannot index docvars this way because none exist
But keep in mind this, from ?corpus
:
A warning on accessing corpus elements
A corpus currently consists of an S3 specially classed list of elements, but you should not access these elements directly. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change (as it inevitably will as we continue to develop the package, including moving corpus objects to the S4 class system).
😉
@conjugateprior Refresh with the latest GitHub version and try it now.
FYI I was str
ing in the first place so I could sketch out a corpus_merge_docvars
function that I have now needed several times and hacked around. (Something like tmaptools::append_data
). If you're planning such a function, let me know and I won't duplicate the work.
In the meantime I'll wait until the innards settle down.
Have you seen the +
and c
methods for the corpus
class? Might be what you are after.
corpus1 <- corpus_subset(data_corpus_inaugural, President == "Bush")
corpus2 <- corpus_subset(data_corpus_inaugural, President == "Clinton")
docvars(corpus2, "newvar") <- "Added to Clinton"
corpus3 <- corpus_subset(data_corpus_inaugural, President == "Obama")
docvars(corpus3, "newvar") <- "Added to Obama"
docvars(c(corpus1, corpus2, corpus3))
## Year President FirstName newvar
## 1989-Bush 1989 Bush George <NA>
## 2001-Bush 2001 Bush George W. <NA>
## 2005-Bush 2005 Bush George W. <NA>
## 1993-Clinton 1993 Clinton Bill Added to Clinton
## 1997-Clinton 1997 Clinton Bill Added to Clinton
## 2009-Obama 2009 Obama Barack Added to Obama
## 2013-Obama 2013 Obama Barack Added to Obama
docvars(corpus2 + corpus3)
## Year President FirstName newvar
## 1993-Clinton 1993 Clinton Bill Added to Clinton
## 1997-Clinton 1997 Clinton Bill Added to Clinton
## 2009-Obama 2009 Obama Barack Added to Obama
## 2013-Obama 2013 Obama Barack Added to Obama
If not, consider a PR that operates using accessor functions (try methods(class = "corpus")
for a list), or just describe what you are looking for and we could add it.
Definitely not +
or c
.
As in the SpatialPolygonDataFrame
function I linked to above it's about having maybe incomplete or overcomplete hand constructed document metadata in a data.frame
and (left) joining it with a corpus object via a key that is a corpus docvar on the left side, and regular data.frame
column on the right side.
Currently it seems one must have the external metadata go in column by column and hope it lines up with the exact ordering of documents in the corpus. This has bitten me several times already. Hence the desire for a merge
-like function rather than a cbind
-like function to do that.
Well, we could modify +
for signature corpus, data.frame
so that it performs a left join automatically based on the docname as a key. But first let me make sure I have understood.
You want following:
> docvars(data_corpus_irishbudget2010)
year debate number foren name party
2010_BUDGET_01_Brian_Lenihan_FF 2010 BUDGET 01 Brian Lenihan FF
2010_BUDGET_02_Richard_Bruton_FG 2010 BUDGET 02 Richard Bruton FG
2010_BUDGET_03_Joan_Burton_LAB 2010 BUDGET 03 Joan Burton LAB
2010_BUDGET_04_Arthur_Morgan_SF 2010 BUDGET 04 Arthur Morgan SF
2010_BUDGET_05_Brian_Cowen_FF 2010 BUDGET 05 Brian Cowen FF
2010_BUDGET_06_Enda_Kenny_FG 2010 BUDGET 06 Enda Kenny FG
2010_BUDGET_07_Kieran_ODonnell_FG 2010 BUDGET 07 Kieran ODonnell FG
2010_BUDGET_08_Eamon_Gilmore_LAB 2010 BUDGET 08 Eamon Gilmore LAB
2010_BUDGET_09_Michael_Higgins_LAB 2010 BUDGET 09 Michael Higgins LAB
2010_BUDGET_10_Ruairi_Quinn_LAB 2010 BUDGET 10 Ruairi Quinn LAB
2010_BUDGET_11_John_Gormley_Green 2010 BUDGET 11 John Gormley Green
2010_BUDGET_12_Eamon_Ryan_Green 2010 BUDGET 12 Eamon Ryan Green
2010_BUDGET_13_Ciaran_Cuffe_Green 2010 BUDGET 13 Ciaran Cuffe Green
2010_BUDGET_14_Caoimhghin_OCaolain_SF 2010 BUDGET 14 Caoimhghin OCaolain SF
> (df_tomerge <- data.frame(minister = c(1, 1), row.names = c("2010_BUDGET_01_Brian_Lenihan_FF", "2010_BUDGET_11_John_Gormley_Green")))
minister
2010_BUDGET_01_Brian_Lenihan_FF 1
2010_BUDGET_11_John_Gormley_Green 1
## MERGE COMMAND
## RESULT:
year debate number foren name party minister
2010_BUDGET_01_Brian_Lenihan_FF 2010 BUDGET 01 Brian Lenihan FF 1
2010_BUDGET_02_Richard_Bruton_FG 2010 BUDGET 02 Richard Bruton FG NA
2010_BUDGET_03_Joan_Burton_LAB 2010 BUDGET 03 Joan Burton LAB NA
2010_BUDGET_04_Arthur_Morgan_SF 2010 BUDGET 04 Arthur Morgan SF NA
2010_BUDGET_05_Brian_Cowen_FF 2010 BUDGET 05 Brian Cowen FF NA
2010_BUDGET_06_Enda_Kenny_FG 2010 BUDGET 06 Enda Kenny FG NA
2010_BUDGET_07_Kieran_ODonnell_FG 2010 BUDGET 07 Kieran ODonnell FG NA
2010_BUDGET_08_Eamon_Gilmore_LAB 2010 BUDGET 08 Eamon Gilmore LAB NA
2010_BUDGET_09_Michael_Higgins_LAB 2010 BUDGET 09 Michael Higgins LAB NA
2010_BUDGET_10_Ruairi_Quinn_LAB 2010 BUDGET 10 Ruairi Quinn LAB NA
2010_BUDGET_11_John_Gormley_Green 2010 BUDGET 11 John Gormley Green 1
2010_BUDGET_12_Eamon_Ryan_Green 2010 BUDGET 12 Eamon Ryan Green NA
2010_BUDGET_13_Ciaran_Cuffe_Green 2010 BUDGET 13 Ciaran Cuffe Green NA
2010_BUDGET_14_Caoimhghin_OCaolain_SF 2010 BUDGET 14 Caoimhghin OCaolain SF NA
Yes, that would do it.
Two small caveats.
+
non-commutative looks like trouble, unless you're thinking of type-distinguished (corpus, dataframe) and (data.frame, corpus) implementations of it.OK, thinking about options for syntax:
It could qualify for the corpus_something()
grammar since it takes a corpus as the main argument, and returns a modified corpus. Something like:
corpus_joinvars(thecorpus, newdocvars_data.frame, by = NULL)
where the default is to join by docnames()
(and row.names for the data.frame), but can be set in the same way that dplyr::left_join()
works.
Since it sets docvars
for a corpus (through a left join), it might be more appropriate to be a variant of the docvars()
command. For instance:
docvars(thecorpus, merge_source = newdocvars_data.frame, by = NULL)
or maybe some clever adaptation of the <-.docvars()
function?
How about this:
+
functionality for corpus objects.)Using S4 methods with multiple dispatch will allow us to distinguish these two methods (even with S3 objects). Order from chaos.
Four questions and proposed answers for the semantics of +
with corpus
'corp' and data.frame
'newdocvars'.
rownames
of corpus
and data.frame
?+
left join, ignoring docvars for which there is no corpus document?+
overwrite values of create a new renamed variable for the intersection of colnames(docvars(corp))
and colnames(newdocvars)
?+
maintaindocvars(corp)
variable classes? (factor
is the only hard case)Proposal:
numeric
, character
, and factor
types for old corpus docvars and newdocvars
. Abbreviate them N, F, and S so <N,F> is an originally numeric corpus docvar meeting the factor in newdocvars
that shares its name.
Some discussion of the semantics factor conversion would be useful.
Second suggestion: All this goes into an augmented docvars
command instead: docvars(corp) <- newdocvars
. All the same questions would need anwering for this, so it seems to be an orthogonal question.
@kbenoit Thoughts on these semantics or should I assume they're fine and send a PR?
Insofar as I understood it fully, let's implement your answers to the scheme above. I'd say that the docvars class should be the left side, i.e. the existing variable, and if this is not compatible in the ways you list, then complain and stop.
You mention a PR - great if you code this!
Update: The solution to this could be part of quanteda/quanteda#1214. It could also be solved by the idea of creating a quanteda.dplyr extension package as described in quanteda/quanteda#1171, quanteda/quanteda#529.
@conjugateprior with the new package this should be pretty easy to implement now. I'm adding it to the list.
@kbenoit I was wondering if there is a solution to the question in this thread? I have been unsuccessful in trying to do add external variables to a corpus object. Thanks!
This works
but this doesn't
apparently because there are no docvars
Seems like it should be possible to make a docvar-free corpus though.