Open lmullen opened 7 years ago
i second that, hope to maybe get some inspiration for tm.plugin.koRpus.
I also think this would be great, we need to decide on a canonical sparse matrix format :)
Following up on @matthewjdenny, the integration of document metadata with that sparse thing is the issue for me.
I've been doing a lot of spatial lately, and I think the discussion in that crowd contrasting the older sp::SpatialPolygonsDataFrame
vs newer sf
style of data representation could be very useful to us.
I agree with both of @conjugateprior's points. The move from list/S3/S4 style objects to data frame-centric objects in spatial analysis is a good model. The added difficulty will be finding ways to tie a data frame with metadata to a sparse matrix.
I think this is a great topic to discuss as well, and I agree the right approach is setting up data frame type objects rather than complex S3/S4 objects. I don't think using sparse matrices would be the right base format; there should be functions to convert dense representations to sparse matrices when they are needed.
I actually wrote a paper about my attempts to do this with the Stanford CoreNLP annotations, wrapped up in the package CleanNLP, which can access here if interested: http://taylorarnold.net/cleanNLP_paper.pdf. I think it works well for CoreNLP, but really needs some revising to work as a data model for other package outputs.
OK, I see two sub-threads developing here, one related to @conjugateprior's request for meta-data (document-level) in a document-term/feature matrix. Since it's inevitable that we start pointing out features in our own packages, I might as well plunge in.
We added that to the current dev version of quanteda a few weeks ago. Unless you disable it (e.g. for space reasons), tokens
and dfm
objects both carry document-variables, and subsetting and other operations that select on documents will trim these as appropriate.
require(quanteda)
# Loading required package: quanteda
# quanteda version 0.9.9.42
# Using 7 of 8 cores for parallel computing
#
# Attaching package: ‘quanteda’
#
# The following object is masked from ‘package:utils’:
#
# View
tail(docvars(data_corpus_inaugural))
# Year President FirstName
# 1997-Clinton 1997 Clinton Bill
# 2001-Bush 2001 Bush George W.
# 2005-Bush 2005 Bush George W.
# 2009-Obama 2009 Obama Barack
# 2013-Obama 2013 Obama Barack
# 2017-Trump 2017 Trump Donald J.
data_dfm_inaugural <- dfm(data_corpus_inaugural)
head(docvars(data_dfm_inaugural))
# Year President FirstName
# 1789-Washington 1789 Washington George
# 1793-Washington 1793 Washington George
# 1797-Adams 1797 Adams John
# 1801-Jefferson 1801 Jefferson Thomas
# 1805-Jefferson 1805 Jefferson Thomas
# 1809-Madison 1809 Madison James
My comment on the other sub-thread is below.
Other sub-thread: What should be a canonical data format?
Here I think there will only ever be consensus answers on what is the core data object. Specific packages will want to wrap around these as needed. Authors should write coercion functions to make it easy to switch between their higher-level objects and the core data objects, to promote inter-operability.
Good luck with that one. readtext: a data.frame. quanteda: (currently) A list including a data.frame similar to that in readtext. tm: A complex nested list. koRpus: another list format.
A data.frame is probably the core object, maybe with the text field called text
?
The basic structure should be a named list of characters, as per the tokenizers package. If you need a wrapper, then always write a coercion function to get back to the named list of characters. Example:
require(quanteda)
toks <- tokens(c(doc1 = "This is a sample: of tokens.",
doc2 = "Another sentence, to demonstrate how tokens works."))
# str(toks)
# List of 2
# $ doc1: chr [1:8] "This" "is" "a" "sample" ...
# $ doc2: chr [1:9] "Another" "sentence" "," "to" ...
# - attr(*, "class")= chr [1:2] "tokens" "tokenizedTexts"
# - attr(*, "types")= chr [1:15] "This" "is" "a" "sample" ...
# - attr(*, "what")= chr "word"
# - attr(*, "ngrams")= int 1
# - attr(*, "concatenator")= chr ""
# - attr(*, "padding")= logi FALSE
str(unclass(toks))
# List of 2
# $ doc1: int [1:8] 1 2 3 4 5 6 7 8
# $ doc2: int [1:9] 9 10 11 12 13 14 7 15 8
# - attr(*, "types")= chr [1:15] "This" "is" "a" "sample" ...
# - attr(*, "what")= chr "word"
# - attr(*, "ngrams")= int 1
# - attr(*, "concatenator")= chr ""
# - attr(*, "padding")= logi FALSE
str(as.list(toks))
# List of 2
# $ doc1: chr [1:8] "This" "is" "a" "sample" ...
# $ doc2: chr [1:9] "Another" "sentence" "," "to" ...
and you can go back to tokens from a list of characters:
as.tokens(as.list(toks))
# tokens from 2 documents.
# doc1 :
# [1] "This" "is" "a" "sample" ":" "of" "tokens" "."
#
# doc2 :
# [1] "Another" "sentence" "," "to" "demonstrate" "how" "tokens" "works" "."
Here I vote for using the excellent Matrix package, with the dgCMatrix
as the default format. It would be great if there were a diCMatrix
(integer format) but that seems to be only a development aspiration at the moment.
But it can also be easy to coerce this:
as.matrix(data_dfm_LBGexample[1:5, 1:10])
features
docs A B C D E F G H I J
R1 2 3 10 22 45 78 115 146 158 146
R2 0 0 0 0 0 2 3 10 22 45
R3 0 0 0 0 0 0 0 0 0 0
R4 0 0 0 0 0 0 0 0 0 0
R5 0 0 0 0 0 0 0 0 0 0
and more explicit convert()
can convert a quanteda dfm to other formats, e.g. for the lda, tm, stm, austin, topicmodels, and lsa packages.
I like the idea of making these as close to existing R models, e.g. lm()
, glm()
, as possible: with methods for predict
, coefficients
, etc. wherever possible.
See
> methods(class = "lm")
[1] add1 alias anova case.names coerce confint cooks.distance deviance dfbeta
[10] dfbetas drop1 dummy.coef effects extractAIC family formula hatvalues influence
[19] initialize kappa labels logLik model.frame model.matrix nobs plot predict
[28] print proj qr residuals rstandard rstudent show simulate slotsFromS3
[37] summary variable.names vcov
My vote here is for excellent stringi package. This really ought to replace all equivalent functions in the base package, since they are smarter and faster than their base R counterparts, and fully Unicode compliant.
The regular expression syntax is slightly different and I look forward to learning a bit more about that at the workshop.
And all character objects ought to be UTF-8!
I heartily add my vote for the Matrix and stringi package as the basis for DTM's and string handling, respectively!
I like the idea of canonical core data. There will probably always be reasons to create different corpus objects, but if we can decide on core objects that are broad enough to encompass our specific corpus implementations, then each individual package would only need to be able to convert from and to the core object (notwithstanding the possibility to add specific one-on-one conversions if these are more efficient).
So, would it be worthwhile to consider creating an R package with canonical core data objects, a nifty convert function and some good ideas for class inheritance?
Just quick notes. Will write detailed post lated.
There should be different core structures for different levels. I mostly agree with @kbenoit about levels:
Some thoughts :
dgCMatrix
is main work-horse in Matrix
, all the packages first create matrix in coordinate format. dgCMatrix
is good for computations, but conversion from coordinate dgTMatrix
to other formats will be cheaper. For example recently I more actively used dgRMatrix
. Thought about this for a while, but still not sure what is the best approach.I generally agree the data structures listed by @dselivanov. Some thoughts on these:
The specific variables for these objects I think is totally an open subject, but I personally think talking about schemas rather than objects is likely the right approach.
One alternative to one-size-fits all would be:
Define of an interoperability format for each core object type, that each package can use for I/O. Each package can provide coercion functions to that. e.g. our hashed tokens object becomes a list identical to the return from tokenizers::tokenize_words()
using quanteda::as.list(quanteda_tokens_object)
.
Define of an interoperability format for each core object type, that each package can use for I/O. Each package can provide coercion functions to that.
i am all for this approach. moving away from S4 classes and methods within koRpus
would result in a complete re-write of the package -- i hardly consider this an option ;-)
it is also strategically the wiser move to define intermediate formats. for example, coercion functions can be implemented by someone who is not even the original package author and released as a new package. that can be done for any kind of package that is of interest to anyone in the community.
it would be great to have a collection of example data and the expected results in such a format, so that package authors can validate their implementation.
I agree that an interoperability format would be another option. There is one extant option for representing annotated text that has a modest amount of popularity within computational linguistics: CoNLL-X/U. It's absolutely awful for actually doing an analysis directly given all the type overriding, but could work for moving data in between packages (that was its original intention anyway).
There is one extant option for representing annotated text that has a modest amount of popularity within computational linguistics: CoNLL-X/U.
looks interesting. the structure is not so different from the data.frames that koRpus
already carries in its tokenized text objects.
this opens the box even further -- interoperability not only between R packages, but all kinds of applications in general (e.g., TCF).
Ummm I was thinking more of R formats as @dselivanov was listing above. e.g. tokens = named list, dfm = Matrix object, etc. The idea is to keep it as unencumbered as possible with additional structure - that's what makes it interoperable.
On @unDocUMeantIt's point on S4: Complicated objects are supposed to make life easier, not more complicated, but this only happens if there are methods (print
, str
, [
, [[
, etc) for getting stuff in and out as if they were much simpler objects. And as.simpleobject()
methods for converting them. As per my post above of the tokens example, if you study it you'll realise there is something deep and slightly strange about a quanteda tokens
object but it's almost invisible to the user. S4 is great as long as define methods so that the user never needs to access the slots directly.
I could add that one point we even redefined str.corpus
to prevent users from nosing around inside their metaphorical iPhone cases. Would have kept it except it led to a lot of other strange side effects from functions that rely on str
.
S4 is great as long as define methods so that the user never needs to access the slots directly.
absolutely. there's various getter and setter methods in koRpus
, too. for example, you get the data.frame with the tagged text i mentioned earlier when you call taggedText()
on that object. the nice thing is that the objects include their own metadata, for example the language of the text, so you can simply call hyphen()
on the same object to get it hyphenated, and it will automatically pick the correct hyphenation patterns for the respective language.
and you can also update the data.frame via taggedText() <-
Ummm I was thinking more of R formats as @dselivanov was listing above. e.g. tokens = named list, dfm = Matrix object, etc. The idea is to keep it as unencumbered as possible with additional structure - that's what makes it interoperable.
I see. Minimalism is a good thing to strive for, though I am often working with text data that cannot be stored in-memory and is often distributed over a cluster. Ideally, I'm partial to having a format that can be saved as plain text (JSON, CSV, XML) so that it is easy to save state and load it back again. R can save binary objects of course, but those don't play well with distributed file systems like Hadoop nor can they be mixed well with other languages (another thing that comes up for me).
However, most basic R objects have a plain-text equivalent. Named lists are essentially in a JSON format. Data frames and sparse matrices can be stored as CSV. So I'm not sure these two things are really at odds with one another.
@statsmaths This would make a good, new issue: How to store, and possibly process, large corpora out-of-memory. Could include issues related to compression. @patperry and @dselivanov have experience with this. I can pull texts from a back-end database but it still needs to fit in memory. (So far 32GB has been more than enough to handle the texts I work with.)
In my own package (not public yet, but hopefully will be before the workshop), I ended up defining a new type on the C side, something like the following:
struct text {
uint8_t *ptr; // UTF-8 encoded data, possibly with JSON-style backslash (\) escapes
unsigned int has_esc : 1; // flag indicating whether to interpret '\' as an escape
unsigned int is_utf8 : 1; // flag indicating whether the text may decode to non-ASCII
uint64_t size : 62; // size of the encoded data, in bytes
};
The important difference from the R character
type is the has_esc
flag. The presence of this flag allows you to have a text object referring to data stored in a file, without decoding the string and storing it in RAM. (I mmap
the file and then let the operating system deal with moving the data between the file and memory whenever necessary.) You can process a multi-gigabyte JSON file transparently, without loading the whole thing into RAM.
The drawback of using a struct text
instead of an R character
is that you have to convert to an R character
whenever you want to interface to another package like stringi
, which kills the efficiency and memory advantages. To get around this, I ended up implementing my own Unicode normalization, case folding, and segmentation.
What about is_utf8
→ is_ascii
?
The advantage of having the flag turn on for non-ASCII is that when you are scanning text from a JSON file, you can start with all flags off. As soon as you see an escape code, the escape flag gets set; as soon as you see a non-ASCII character, the utf8 flag gets set. So, you can do something like the following:
flags = 0
for (ch, ch_flags) in input:
flags |= ch_flags
where the for loop iterates over all input characters, with ch
being the input character, and ch_flags
being the flags (has_esc
and is_utf8
) for that character.
and you can also update the data.frame via taggedText() <-
as a side note, thanks to this thread i've added [
, [<-
,[[
, and [[<-
methods for all main text object classes in the koRpus
package, so you can now treat them like a data.frame.
I think a key issue to discuss is how to make R text packages interoperable, so that new packages extend functionality rather than compete with one another, and so that objects created in one package can be used with a minimum of conversion in other packages. Some areas to discuss:
A potential output of such a session would be a draft set of best practices, and perhaps a list of key places where objects are not currently interoperable but could be fixed by improvements to existing R packages.