ropensci / textworkshop17

Text Workshop at the London School of Economics, April 2017
21 stars 7 forks source link

Proposal for Interoperability Formats #14

Open statsmaths opened 7 years ago

statsmaths commented 7 years ago

We are proposing three formats for interoperability of text data between packages.

Corpus - a normal data frame with S3 class equal to c("corpus", "data.frame"). It has no rownames and has at least two columns. The first column is called doc_id and is a character vector with UTF-8 encoding. Document ids must be unique. The second column is called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single string in text / row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required.

Document Term Matrix - should be a sparse numeric matrix with document ids as rownames and terms as column names, both of these are character vectors. There is one element in the row names for each document and one element in the column names for each term. Document ids and terms must be unique. We suggest the dgCMatrix from the Matrix package.

Tokens/Annotation - a normal data frame with S3 class equal to c("token", "data.frame"). We define a token to be a single element of a character vector. We propose representing these as a data frame. The first column of this data frame is called doc_id and is a character vector with UTF-8 encoding. The second column is called token_index and is an integer vector, with values starting at 1. The third column is called token and is a UTF-8 encoded character vector. Additional annotations can be provided as other columns. We suggest the following names and data types for common annotations: pos (character vector; UTF-8), lemma (character vector; UTF-8), and sentence_id (integer).

lmullen commented 7 years ago

Should the S3 class for the tokens be tokens not token, since there will always be plural tokens?

statsmaths commented 7 years ago

I'm okay with that. It may even have been a mis-communication about what it was supposed to be.

patperry commented 7 years ago

I'd vote for removing the S3 classes so that we don't run into havoc when two different packages implement summary.corpus or print.corpus.

statsmaths commented 7 years ago

Things mentioned at the workshop (I think the subgroup confirmed that these all make sense)

patperry commented 7 years ago

From a usability standpoint, it seems important to me that a function like tokens that is expecting a corpus argument should also accept a character vector, so that you can do something like:

tokens(c("A test sentence.", "Another test sentence."))

This might not matter for power users, but for beginners, I'd rather not force them to create a data frame and do something like

tokens(data.frame(doc_id=1:2, text=c("Sentence 1.", "Sentence 2."))

My personal preference would be for functions like tokens that expect corpus objects do something like the following:

tokens <- function(x) {
    if (!is.character(x)) {
        ids <- names(x)
        text <- as.character(x) # this will drop the names
    } else {
        ids <- names(x)
        text <- x
    }
    ...
}

A side benefit of this approach is that it also supports the "metadata only" corpus objects like @dselivanov described: the as.character function is an S3 generic that special objects can overload. (You can make your function be memory-aware by breaking x into smaller chunks and then calling as.character once per chunk.)

I'm not crazy about having a data frame with a special "text" field. For many data sets this makes sense, but not for all. I have a product review data set where each review has three fields that could be useful as text: title, summary, and body. Should I be forced to make three separate corpus objects? I'd rather store all of the reviews in a single data frame, and then be able to make calls like

text_summary(review$title)
text_summary(review$body)

where text_summary is some function that is expecting a corpus.

Unfortunately, if you take the stance that I am arguing for, it has implications for the tokens interchange format. If you call

tokens(x)

then you need to be able to identify the outputs to the input. The most natural way (to me) to do this is to have the output be a list with the same length as x. In cases where x does not have names, there won't be doc_ids.


My recommendation: a corpus should be any object that overloads is.character, as.character, and names; ideally, it should also support length and subsetting (x[i]), so that other functions can process the corpus in chunks rather than all at once.

kbenoit commented 7 years ago

Of course you can define methods for tokens for any class of object, but I think:

statsmaths commented 7 years ago

@patperry I totally agree that packages should aim to also accept raw character vectors as a representation of a corpus. It is a very natural way to store a corpus both for new users and useful in cases like of your example where there are many different components of the corpus.

I did not think we were suggesting that the text interchange format be the only way that packages should accept inputs. I thought of it as the way that packages share data amongst one another... That is, packages should all (1) minimally accept these formats and (2) provide coercion methods into these formats. There is nothing stopping them from supporting other inputs nor should they feel forced to give different outputs if they so choose.

I spoke with @lmullen about the formats regarding the tokenizers package yesterday. The idea that I think we agreed on is that he would modify the package so that users could optionally input a corpus data frame and optionally request (default = FALSE) a tokens data frame.

Now, I suppose as an alternative we could instead define multiple interchange formats for a given object as long as each is easily distinguishable (i.e., one is a vector and the other a list). In this case the natural thing to do would be to let a corpus alternatively be defined as you described: anything coercible to a (possibly named) character vector. Also, tokens could alternatively be described as a (possibly named) list of character vectors. If we had these alternatives the tif package could provide canonical interchanges between these formats along the lines of what @kbenoit was suggesting. That is, something that takes other format and returns whichever one a package developer wishes to work with.

Does having two formats seem reasonable? I can try to write those up if that seems like a goo compromise. There definitely were two competing philosophies (data frames vs list/vector) and having both should allow developers and users to use whichever format they prefer.

kbenoit commented 7 years ago

Sounds good. I'm proposing if we do that, then tokenizers methods be (e.g.):

tokenize_words <- function(x, ...) {
    UseMethod("tokenize_words")
}

tokenize_words.tif_corpus <- function(x, ...) {
    tokenize_words(tif::as.character(x))
}

tokenize_words.character <- function(x, ...) {
    # existing code for tokenize_words()
}

The alternative to keep things more as they currently are, is to make the user call

tokenize_words(tif::as.character(x))

This is exactly what we have done to quanteda to make it work with readtext, and what we are proposing here for interoperability is basically the same solution. See https://github.com/kbenoit/quanteda/blob/master/R/readtext-methods.R. (Here you can also see how we made it so that the help index and man page for the functions in this file are not full of ugly and repetitive S3 methods: By making sure all the default function has identical signatures to those of all methods, and the methods headers use @noRd.

statsmaths commented 7 years ago

Great, I'm glad that generally idea makes sense. I think one issue with your specific suggestion is that it would force users to attach the class tif_corpus to the input and there was a very strong vote against that yesterday (and I agree now, too).

I envisioned implementing this by defining functions tif_as_corpus_df and tif_as_corpus_character. Both would accept either format but would return the specified type. For example, the tokenize_words function would look like something like this:

tokenize_words <- function(x, ...) {
  text <- tif_as_corpus_character(x)
  # existing code
}

This way users can input any format they want and package maintainers can only work in the format that they choose.

kbenoit commented 7 years ago

As one of the vocal opponents of classing the interchange objects, I agree with that!

Would tif_as_corpus_character() be in package tif then?

statsmaths commented 7 years ago

Would tif_as_corpus_character() be in package tif then?

Yes, exactly. I'll write up a proposal for these conversion functions today and put them into tif.

lmullen commented 7 years ago

A little late to the discussion, but I agree with the conclusion that you worked to. If we're not going to class the corpus objects, then instead of redefining tokenizers are methods I will just coerce those inputs in the existing function.

I also agree that these coercion functions should go in the tif package, with maybe any edge cases handled by individual packages. So that will necessitate a change from what I said earlier: I think that I will not provide an argument to tokenization functions to return output as a data frame in the interchange format. Users will get back a list of tokens (which is basically an interchange format already) and if they want to convert it to the interchange data frame they can so explicitly.

kbenoit commented 7 years ago

How about:

statsmaths commented 7 years ago

I was thinking that tif would contain the three functions it already currently does, but modified to include the alternative corpus and token types:

As well as these four conversion functions:

These could have a non-strict option for trying to coerce bad objects (i.e., incorrect names or encoding) with a warning. In order to be tif compliant, packages just have to do the following two things:

  1. If using user-input corpus, token, or dtm objects, accept all compliant tif input formats. With the conversion functions, the multiple alternative formats should not be a difficult hurdle to overcome. Packages are free to accept other formats as well.
  2. Either return any corpus, token, dtm object in any tif-compliant format or provide a conversion function from a native type into one of the tif-compliant types.

This gets around the issue of how to tell a corpus data frame from a tokens data frame. It also encourages packages to just work entirely with unclassed, tif-compliant objects whenever possible, at least at the user-level. While we should not enforce this, I think it is a good practice when there is no particular reason to require customized classes. It also takes the burden off of package maintainers to have to write a series of exported conversion functions.

kbenoit commented 7 years ago

I like the function suite proposal, with the small modification to call the first three tif_validate_corpus, tif_validate_dtm, tif_validate_tokens.

A "character-based corpus" here means just a named character object?

statsmaths commented 7 years ago

Yes, that is a good suggestion. I will change the names to be of the form tif_validate_.

And I was thinking of a character-based corpus being just a named character object. I suppose that wording is a bit vague out of context. Would using the word vector be better? So we would have tif_as_corpus_vector and tif_as_tokens_vector instead?

lmullen commented 7 years ago

Yes, I think calling it something like a named character vector where the names are unique document IDs would be clear.

Should tif support a named list where each element is a character vector of length one containing a document and the names are the document ids? That seems like a natural analog to a named list of tokens. (Tokenizers accepts that as an input.) I don't want to proliferate formats. But at a minimum, the tif_as_corpus_* functions should accept that as an input to be converted.

patperry commented 7 years ago

I like the general idea, but I would suggest adopting the conventions of the base R functions.

For checking if something is a valid object:

These return TRUE/FALSE and they do not emit warnings or errors.

For converting:

These emit errors if the input is invalid. I think you can do without the validate functions. Just let the conversion routines do all the validation and throw an error if necessary (they will have to do this anyway).

I don't feel strongly about this, but I'd leave the tif_ prefix off the function names. Packages worried about namespace conflicts can use tif::as_corpus_vector. (I've moved to the explicit namespace convention more and more in my teaching to make clear to students what packages define what functions.)

Notice also that I've left out the dtm functions. Can't we just use as(x, "dgCMatrix") for that?

dselivanov commented 7 years ago

+1 to @patperry for naming.

kbenoit commented 7 years ago

I'd vote instead for a named character (vector) where the names are document identifiers and enforced to be unique. No loss of information from the list of named length one characters and simpler to work with. tokenizers and stringi functions already work fine with this structure.

kbenoit commented 7 years ago

I like @patperry's naming suggestions too. I could live without the tif_ prefixes. But I suggest as_corpus_character ... instead of _vector (since everything is a vector).

We still need the functions for the dtm however since there should be some enforcement of the structure that document and term names are unique, and possibly (my weak preference) that the dimensions are also named "documents" and "terms", which, if we adopt names, would be standard. The elements of these names (the doc ids and the term labels) can be anything the user sets, but we would also set a default (see below) if for some reason they are NULL.

Something like:

m <- matrix(1:20, nrow = 4)

as(Matrix(m, dimnames = list(documents = paste0("d", seq_len(nrow(m))), 
                             terms = paste0("term", seq_len(ncol(m))))), 
   "dgCMatrix")
## 4 x 5 sparse Matrix of class "dgCMatrix"
##          terms
## documents term1 term2 term3 term4 term5
##        d1     1     5     9    13    17
##        d2     2     6    10    14    18
##        d3     3     7    11    15    19
##        d4     4     8    12    16    20
patperry commented 7 years ago

Would the following interface be acceptable for the "data frames" people?

tokens <- function(x, ids = names(x)) {
    text <- as.character(x)
    ans <- stringi::stri_split_boundaries(text, type="word")
    names(ans) <- ids
    ans
}

This function will accept a few formats:

# 1. character vector without names

tokens(c("The first sentence.", "The next!"))
## [[1]]
## [1] "The"      " "        "first"    " "        "sentence" "."       
##
## [[2]]
## [1] "The"  " "    "next" "!"   

# 2. character vector with names

## tokens(c(a="The first sentence.", a="The next!"))
## $a
## [1] "The"      " "        "first"    " "        "sentence" "."       
##
## $a
## [1] "The"  " "    "next" "!"  

# 3. list without names

tokens(list("The first sentence.", "The next!"))

# 4. list with names

tokens(list(a="The first sentence.", b="The next!"))

# 5. data frame (with explicit argument passing)

df <- data.frame(doc_id=c("a", "b"), text=c("The first sentence.", "The next!"))
with(df, tokens(text, doc_id))

# 6. custom object

x <- structure(c(apache2="http://www.apache.org/licenses/LICENSE-2.0.txt",
                 gpl3="https://www.gnu.org/licenses/gpl-3.0.txt"),
               class="urlcorpus")

as.character.urlcorpus <- function(x) {
    sapply(x, function(addr) paste(readLines(url(addr)), collapse="\n"))
}

tokens(x) # this will download the files and tokenize them
statsmaths commented 7 years ago

Thanks for all of the feedback. I think this has been really helpful and the format is greatly improved from our initial proposal.

I have consolidated the recent proposals and implemented them in version 0.2.0 of the tif package: https://github.com/ropensci/tif. This includes the new naming conversions, multiple tokens/corpus object types, and new coercion functions.

The only change I can think of here that I did not do (yet at least) is removing the tif_ prefix; there was strong support for this amongst some at the workshop and it is recommended by the rOpenSci guidelines. Like @patperry, I tend to explicitly using the tif:: format, so it doesn't effect me as much but I'm just trying to play nicely with those published standards.

Unless there are any major roadblocks, can I suggest offloading this discussion to the tif repo issues page at this point? It would be easier, I think, if we could open individual issues for any particular proposed changes or bugs. We can post back here when we feel that the format as stabilized enough to start integrating it into our packages.

kbenoit commented 7 years ago

Excellent, thanks. When done with my workshop(s) this weekend I will be happy to review and contribute code.