ropensci / textworkshop17

Text Workshop at the London School of Economics, April 2017
21 stars 7 forks source link

Package interoperability #2

Open lmullen opened 7 years ago

lmullen commented 7 years ago

I think a key issue to discuss is how to make R text packages interoperable, so that new packages extend functionality rather than compete with one another, and so that objects created in one package can be used with a minimum of conversion in other packages. Some areas to discuss:

A potential output of such a session would be a draft set of best practices, and perhaps a list of key places where objects are not currently interoperable but could be fixed by improvements to existing R packages.

unDocUMeantIt commented 7 years ago

i second that, hope to maybe get some inspiration for tm.plugin.koRpus.

matthewjdenny commented 7 years ago

I also think this would be great, we need to decide on a canonical sparse matrix format :)

conjugateprior commented 7 years ago

Following up on @matthewjdenny, the integration of document metadata with that sparse thing is the issue for me.

I've been doing a lot of spatial lately, and I think the discussion in that crowd contrasting the older sp::SpatialPolygonsDataFrame vs newer sf style of data representation could be very useful to us.

lmullen commented 7 years ago

I agree with both of @conjugateprior's points. The move from list/S3/S4 style objects to data frame-centric objects in spatial analysis is a good model. The added difficulty will be finding ways to tie a data frame with metadata to a sparse matrix.

statsmaths commented 7 years ago

I think this is a great topic to discuss as well, and I agree the right approach is setting up data frame type objects rather than complex S3/S4 objects. I don't think using sparse matrices would be the right base format; there should be functions to convert dense representations to sparse matrices when they are needed.

I actually wrote a paper about my attempts to do this with the Stanford CoreNLP annotations, wrapped up in the package CleanNLP, which can access here if interested: http://taylorarnold.net/cleanNLP_paper.pdf. I think it works well for CoreNLP, but really needs some revising to work as a data model for other package outputs.

kbenoit commented 7 years ago

OK, I see two sub-threads developing here, one related to @conjugateprior's request for meta-data (document-level) in a document-term/feature matrix. Since it's inevitable that we start pointing out features in our own packages, I might as well plunge in.

We added that to the current dev version of quanteda a few weeks ago. Unless you disable it (e.g. for space reasons), tokens and dfm objects both carry document-variables, and subsetting and other operations that select on documents will trim these as appropriate.

require(quanteda)
# Loading required package: quanteda
# quanteda version 0.9.9.42
# Using 7 of 8 cores for parallel computing
# 
# Attaching package: ‘quanteda’
# 
# The following object is masked from ‘package:utils’:
#     
#     View

tail(docvars(data_corpus_inaugural))
#              Year President FirstName
# 1997-Clinton 1997   Clinton      Bill
# 2001-Bush    2001      Bush George W.
# 2005-Bush    2005      Bush George W.
# 2009-Obama   2009     Obama    Barack
# 2013-Obama   2013     Obama    Barack
# 2017-Trump   2017     Trump Donald J.

data_dfm_inaugural <- dfm(data_corpus_inaugural)
head(docvars(data_dfm_inaugural))
#                 Year  President FirstName
# 1789-Washington 1789 Washington    George
# 1793-Washington 1793 Washington    George
# 1797-Adams      1797      Adams      John
# 1801-Jefferson  1801  Jefferson    Thomas
# 1805-Jefferson  1805  Jefferson    Thomas
# 1809-Madison    1809    Madison     James

My comment on the other sub-thread is below.

kbenoit commented 7 years ago

Other sub-thread: What should be a canonical data format?

Here I think there will only ever be consensus answers on what is the core data object. Specific packages will want to wrap around these as needed. Authors should write coercion functions to make it easy to switch between their higher-level objects and the core data objects, to promote inter-operability.

Corpus

Good luck with that one. readtext: a data.frame. quanteda: (currently) A list including a data.frame similar to that in readtext. tm: A complex nested list. koRpus: another list format.

A data.frame is probably the core object, maybe with the text field called text?

Tokens

The basic structure should be a named list of characters, as per the tokenizers package. If you need a wrapper, then always write a coercion function to get back to the named list of characters. Example:

require(quanteda)

toks <- tokens(c(doc1 = "This is a sample: of tokens.",
                 doc2 = "Another sentence, to demonstrate how tokens works."))
# str(toks)
# List of 2
# $ doc1: chr [1:8] "This" "is" "a" "sample" ...
# $ doc2: chr [1:9] "Another" "sentence" "," "to" ...
# - attr(*, "class")= chr [1:2] "tokens" "tokenizedTexts"
# - attr(*, "types")= chr [1:15] "This" "is" "a" "sample" ...
# - attr(*, "what")= chr "word"
# - attr(*, "ngrams")= int 1
# - attr(*, "concatenator")= chr ""
# - attr(*, "padding")= logi FALSE

str(unclass(toks))
# List of 2
# $ doc1: int [1:8] 1 2 3 4 5 6 7 8
# $ doc2: int [1:9] 9 10 11 12 13 14 7 15 8
# - attr(*, "types")= chr [1:15] "This" "is" "a" "sample" ...
# - attr(*, "what")= chr "word"
# - attr(*, "ngrams")= int 1
# - attr(*, "concatenator")= chr ""
# - attr(*, "padding")= logi FALSE

str(as.list(toks))
# List of 2
# $ doc1: chr [1:8] "This" "is" "a" "sample" ...
# $ doc2: chr [1:9] "Another" "sentence" "," "to" ...

and you can go back to tokens from a list of characters:

as.tokens(as.list(toks))
# tokens from 2 documents.
# doc1 :
# [1] "This"   "is"     "a"      "sample" ":"      "of"     "tokens" "."     
# 
# doc2 :
# [1] "Another"     "sentence"    ","           "to"          "demonstrate" "how"         "tokens"      "works"       "."          

Document-term matrix format

Here I vote for using the excellent Matrix package, with the dgCMatrix as the default format. It would be great if there were a diCMatrix (integer format) but that seems to be only a development aspiration at the moment.

But it can also be easy to coerce this:

as.matrix(data_dfm_LBGexample[1:5, 1:10])
    features
docs A B  C  D  E  F   G   H   I   J
  R1 2 3 10 22 45 78 115 146 158 146
  R2 0 0  0  0  0  2   3  10  22  45
  R3 0 0  0  0  0  0   0   0   0   0
  R4 0 0  0  0  0  0   0   0   0   0
  R5 0 0  0  0  0  0   0   0   0   0

and more explicit convert() can convert a quanteda dfm to other formats, e.g. for the lda, tm, stm, austin, topicmodels, and lsa packages.

kbenoit commented 7 years ago

Analytic functions

I like the idea of making these as close to existing R models, e.g. lm(), glm(), as possible: with methods for predict, coefficients, etc. wherever possible.

See

> methods(class = "lm")
 [1] add1           alias          anova          case.names     coerce         confint        cooks.distance deviance       dfbeta        
[10] dfbetas        drop1          dummy.coef     effects        extractAIC     family         formula        hatvalues      influence     
[19] initialize     kappa          labels         logLik         model.frame    model.matrix   nobs           plot           predict       
[28] print          proj           qr             residuals      rstandard      rstudent       show           simulate       slotsFromS3   
[37] summary        variable.names vcov 
kbenoit commented 7 years ago

string handling

My vote here is for excellent stringi package. This really ought to replace all equivalent functions in the base package, since they are smarter and faster than their base R counterparts, and fully Unicode compliant.

The regular expression syntax is slightly different and I look forward to learning a bit more about that at the workshop.

And all character objects ought to be UTF-8!

kasperwelbers commented 7 years ago

I heartily add my vote for the Matrix and stringi package as the basis for DTM's and string handling, respectively!

I like the idea of canonical core data. There will probably always be reasons to create different corpus objects, but if we can decide on core objects that are broad enough to encompass our specific corpus implementations, then each individual package would only need to be able to convert from and to the core object (notwithstanding the possibility to add specific one-on-one conversions if these are more efficient).

So, would it be worthwhile to consider creating an R package with canonical core data objects, a nifty convert function and some good ideas for class inheritance?

dselivanov commented 7 years ago

Just quick notes. Will write detailed post lated.

There should be different core structures for different levels. I mostly agree with @kbenoit about levels:

  1. raw input
  2. tokens
  3. document-term matrix and other matrices for vector space

Some thoughts :

  1. Raw input should not mandatory contain text itself. This is important for out-of-core (streaming) computations. I believe core data structure for this stage should mainly contain metadata and optionally content itself.
  2. Tokens can be extended to be able to capture annotations.
  3. While dgCMatrix is main work-horse in Matrix, all the packages first create matrix in coordinate format. dgCMatrix is good for computations, but conversion from coordinate dgTMatrix to other formats will be cheaper. For example recently I more actively used dgRMatrix. Thought about this for a while, but still not sure what is the best approach.
statsmaths commented 7 years ago

I generally agree the data structures listed by @dselivanov. Some thoughts on these:

The specific variables for these objects I think is totally an open subject, but I personally think talking about schemas rather than objects is likely the right approach.

kbenoit commented 7 years ago

One alternative to one-size-fits all would be:

Define of an interoperability format for each core object type, that each package can use for I/O. Each package can provide coercion functions to that. e.g. our hashed tokens object becomes a list identical to the return from tokenizers::tokenize_words() using quanteda::as.list(quanteda_tokens_object).

unDocUMeantIt commented 7 years ago

Define of an interoperability format for each core object type, that each package can use for I/O. Each package can provide coercion functions to that.

i am all for this approach. moving away from S4 classes and methods within koRpus would result in a complete re-write of the package -- i hardly consider this an option ;-)

it is also strategically the wiser move to define intermediate formats. for example, coercion functions can be implemented by someone who is not even the original package author and released as a new package. that can be done for any kind of package that is of interest to anyone in the community.

it would be great to have a collection of example data and the expected results in such a format, so that package authors can validate their implementation.

statsmaths commented 7 years ago

I agree that an interoperability format would be another option. There is one extant option for representing annotated text that has a modest amount of popularity within computational linguistics: CoNLL-X/U. It's absolutely awful for actually doing an analysis directly given all the type overriding, but could work for moving data in between packages (that was its original intention anyway).

unDocUMeantIt commented 7 years ago

There is one extant option for representing annotated text that has a modest amount of popularity within computational linguistics: CoNLL-X/U.

looks interesting. the structure is not so different from the data.frames that koRpus already carries in its tokenized text objects.

this opens the box even further -- interoperability not only between R packages, but all kinds of applications in general (e.g., TCF).

kbenoit commented 7 years ago

Ummm I was thinking more of R formats as @dselivanov was listing above. e.g. tokens = named list, dfm = Matrix object, etc. The idea is to keep it as unencumbered as possible with additional structure - that's what makes it interoperable.

On @unDocUMeantIt's point on S4: Complicated objects are supposed to make life easier, not more complicated, but this only happens if there are methods (print, str, [, [[, etc) for getting stuff in and out as if they were much simpler objects. And as.simpleobject() methods for converting them. As per my post above of the tokens example, if you study it you'll realise there is something deep and slightly strange about a quanteda tokens object but it's almost invisible to the user. S4 is great as long as define methods so that the user never needs to access the slots directly.

kbenoit commented 7 years ago

I could add that one point we even redefined str.corpus to prevent users from nosing around inside their metaphorical iPhone cases. Would have kept it except it led to a lot of other strange side effects from functions that rely on str.

unDocUMeantIt commented 7 years ago

S4 is great as long as define methods so that the user never needs to access the slots directly.

absolutely. there's various getter and setter methods in koRpus, too. for example, you get the data.frame with the tagged text i mentioned earlier when you call taggedText() on that object. the nice thing is that the objects include their own metadata, for example the language of the text, so you can simply call hyphen() on the same object to get it hyphenated, and it will automatically pick the correct hyphenation patterns for the respective language.

unDocUMeantIt commented 7 years ago

and you can also update the data.frame via taggedText() <-

statsmaths commented 7 years ago

Ummm I was thinking more of R formats as @dselivanov was listing above. e.g. tokens = named list, dfm = Matrix object, etc. The idea is to keep it as unencumbered as possible with additional structure - that's what makes it interoperable.

I see. Minimalism is a good thing to strive for, though I am often working with text data that cannot be stored in-memory and is often distributed over a cluster. Ideally, I'm partial to having a format that can be saved as plain text (JSON, CSV, XML) so that it is easy to save state and load it back again. R can save binary objects of course, but those don't play well with distributed file systems like Hadoop nor can they be mixed well with other languages (another thing that comes up for me).

However, most basic R objects have a plain-text equivalent. Named lists are essentially in a JSON format. Data frames and sparse matrices can be stored as CSV. So I'm not sure these two things are really at odds with one another.

kbenoit commented 7 years ago

@statsmaths This would make a good, new issue: How to store, and possibly process, large corpora out-of-memory. Could include issues related to compression. @patperry and @dselivanov have experience with this. I can pull texts from a back-end database but it still needs to fit in memory. (So far 32GB has been more than enough to handle the texts I work with.)

patperry commented 7 years ago

In my own package (not public yet, but hopefully will be before the workshop), I ended up defining a new type on the C side, something like the following:

struct text {
    uint8_t *ptr; // UTF-8 encoded data, possibly with JSON-style backslash (\) escapes
    unsigned int has_esc : 1; // flag indicating whether to interpret '\' as an escape
    unsigned int is_utf8 : 1; // flag indicating whether the text may decode to non-ASCII
    uint64_t size : 62; // size of the encoded data, in bytes
};

The important difference from the R character type is the has_esc flag. The presence of this flag allows you to have a text object referring to data stored in a file, without decoding the string and storing it in RAM. (I mmap the file and then let the operating system deal with moving the data between the file and memory whenever necessary.) You can process a multi-gigabyte JSON file transparently, without loading the whole thing into RAM.

The drawback of using a struct text instead of an R character is that you have to convert to an R character whenever you want to interface to another package like stringi, which kills the efficiency and memory advantages. To get around this, I ended up implementing my own Unicode normalization, case folding, and segmentation.

gagolews commented 7 years ago

What about is_utf8is_ascii ?

patperry commented 7 years ago

The advantage of having the flag turn on for non-ASCII is that when you are scanning text from a JSON file, you can start with all flags off. As soon as you see an escape code, the escape flag gets set; as soon as you see a non-ASCII character, the utf8 flag gets set. So, you can do something like the following:

flags = 0
for (ch, ch_flags) in input:
    flags |= ch_flags

where the for loop iterates over all input characters, with ch being the input character, and ch_flags being the flags (has_esc and is_utf8) for that character.

unDocUMeantIt commented 7 years ago

and you can also update the data.frame via taggedText() <-

as a side note, thanks to this thread i've added [, [<-,[[, and [[<- methods for all main text object classes in the koRpus package, so you can now treat them like a data.frame.