quanteda / spacyr

R wrapper to spaCy NLP
http://spacyr.quanteda.io
251 stars 38 forks source link

spacyr wishlist #109

Open amatsuo opened 6 years ago

amatsuo commented 6 years ago

Hello @kbenoit and other users of spacyr

This is a comprehensive wishlist of spacyr updates inspired by our discussion with @honnibal and @ines. We will implement some of them in future, but is there anything you are particularly intereted in?

Something likely to be implemented

Something nice to have but not sure how many users need it

Just a wish

aourednik commented 6 years ago

Having a "just tokenization" option with lemmatization would be great. Currently trying to use

parsed <- spacy_parse(my_corpus, pos=FALSE, entity=FALSE, dependency=FALSE)
parsed$token <- parsed$lemma
my_tokens <- as.tokens(parsed)

The first line yields a memory overload on a large my_corpus, while tokens(my_corpus) is fast, with no memory problem. I don't to what extent this is due to inherent memory use of spaCy, though.

Could spacyr somehow be included as an option with the tokens function? Like this: ?


my_tokens <- tokens(txtc,
  what='word',
  remove_numbers = TRUE, 
  remove_punct = TRUE, 
  remove_separators=TRUE, 
  remove_symbols = TRUE,
  include_docvars = TRUE,
  lemmatize = "spacy_parse"
)
kbenoit commented 6 years ago

Not a bad idea. @amatsuo maybe add:

spacy_tokenize(x, what = c("word", "sentence"), 
  remove_numbers = FALSE, remove_punct = FALSE,
  remove_symbols = FALSE, remove_separators = TRUE,
  remove_twitter = FALSE, remove_hyphens = FALSE, 
  remove_url = FALSE, value = c("list", "data.frame")

where the last one returns one of the two TIF formats for tokens? This is as close to the quanteda::tokens() as possible and with spacy_tokenize(x, value = "list") %>% as.tokens() provides the options of going straight to a quanteda tokens class using the spaCy tokeniser.

We could also add to spacy_parse() a new option for sentence = TRUE that would remove the sentence_id return field, and number tokens consecutively within document. So if all options are FALSE, it's the same as spacy_tokenize(x, what = "word", value = "data.frame") -- indeed, that function could call this version of spacy_parse().

dmklotz commented 6 years ago

Definitely would be interested in noun phrase extractions.

amatsuo commented 6 years ago

Hi @dmklotz

I opened an issue for noun-phrase extraction (#117). Please provide your thoughts there.

amatsuo commented 6 years ago

@aourednik and @kbenoit

I have implemented spacy_tokenize in tokenize-function branch. Please try and give some feedback to me.

Some options are left out: remove_symbols, remove_hyphens, remove_twitter. In my opinion, these options are about text-preprocessing before handing texts to spaCy NLP. At the moment, spacyr does not import stringi and I don't see much reason to use gsub() in 2018 for potentially large-scale text processing.

cecilialee commented 6 years ago

Is it possible to train a new model with spacyr at the moment?

kbenoit commented 6 years ago

@cecilialee No, for training a new language model you would need to do that in Python using the spaCy instructions. We unlikely to add this facility to spacyr in the foreseeable future.

cecilialee commented 6 years ago

@kbenoit Sure. Then if I've trained a model with python, how can I use (initialize) that model with spacyr?

amatsuo commented 6 years ago

@cecilialee

The model argument of spacyr_initialize is handed to the model name argument of spacy.load('**'). So you should be able to use the name of the model you saved in python when you call spacy_initialize.

aourednik commented 6 years ago

@amatsuo Is there a simple way to install the full tokenize-function branch version of spacyr in R ?

kbenoit commented 6 years ago

@aourednik that would be

devtools::install_github("quanteda/spacyr", ref = "tokenize-function")
aourednik commented 6 years ago

Great thanks for these developments! By the way, this has more to do with Quanteda in general than with spacyr, but since we are speaking of lemmatization, I was wondering if it would it be feasible to implement a udpipe lemmatizer in the totokens() function ? Or something like udpipe_tokenize() taking a Quanteda corpus as argument and returning lemmatized tokens? UDPipe is reported to perform better, though slower, lemmatization for French, Italian and Spanish than SpaCy. For now, I can get a list of lists of tokens like this (below) but having a Quanteda toknes object would allow me to remain within the Quanteda framework.

library("udpipe")
# dl <- udpipe_download_model(language = "french") # necessary only when not yet downloaded
udmodel_french <- udpipe_load_model(file = "french-ud-2.0-170801.udpipe")
#txtc is my quanteda corpus
txtudpipetokens <- lapply(head(texts(txtc)), function(x) {
  udp <- udpipe_annotate(udmodel_french, x)
  return(as.data.table(udp)$lemma)
  }
) 

cf. https://github.com/bnosac/udpipe @amatsuo @jwijffels

kbenoit commented 6 years ago

Glad it's working for you! We should be finished with the integration of the tokenize-function branch next week. When that's completed, it will be very easy to use spacyr for tokenisation or lemmatising.

On integration with udpipe, that's probably better done in that package. @jwijffels we'd be happy to assist with this.

aourednik commented 6 years ago

@amatsuo @kbenoit I have tried out:

devtools::install_github("quanteda/spacyr", ref = "tokenize-function")
parsed <- spacy_tokenize(corpus_sample(txtc,10))
#Error in spacy_tokenize(corpus_sample(txtc, 10)) : 
#  could not find function "spacy_tokenize"
source("https://raw.githubusercontent.com/quanteda/spacyr/tokenize-function/R/spacy_tokenize.R")
parsed <- spacy_tokenize(corpus_sample(txtc,10))
#Error in UseMethod("spacy_tokenize") : 
#  no applicable method for 'spacy_tokenize' applied to an object of class "c('corpus', 'list')"
Session info -------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.4 (2018-03-15)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.423)           
 language en_US                       
 collate  en_US.UTF-8                 
 tz       Europe/Zurich               
 date     2018-08-30                  

Packages -----------------------------------------------------------------------------------------------------
 package      * version date       source                          
 assertthat     0.2.0   2017-04-11 CRAN (R 3.4.1)                  
 backports      1.1.2   2017-12-13 CRAN (R 3.4.3)                  
 base         * 3.4.4   2018-03-16 local                           
 base64enc      0.1-3   2015-07-28 CRAN (R 3.4.2)                  
 bindr          0.1.1   2018-03-13 CRAN (R 3.4.3)                  
 bindrcpp       0.2.2   2018-03-29 CRAN (R 3.4.4)                  
 checkmate      1.8.5   2017-10-24 CRAN (R 3.4.3)                  
 codetools      0.2-15  2016-10-05 CRAN (R 3.3.1)                  
 colorspace     1.3-2   2016-12-14 CRAN (R 3.4.0)                  
 compiler       3.4.4   2018-03-16 local                           
 crayon         1.3.4   2017-09-16 CRAN (R 3.4.3)                  
 curl           3.2     2018-03-28 CRAN (R 3.4.4)                  
 data.table   * 1.11.4  2018-05-27 CRAN (R 3.4.4)                  
 datasets     * 3.4.4   2018-03-16 local                           
 devtools       1.13.6  2018-06-27 CRAN (R 3.4.4)                  
 digest         0.6.16  2018-08-22 CRAN (R 3.4.4)                  
 doMC         * 1.3.5   2017-12-12 CRAN (R 3.4.3)                  
 dplyr          0.7.6   2018-06-29 CRAN (R 3.4.4)                  
 evaluate       0.11    2018-07-17 CRAN (R 3.4.4)                  
 fastmatch      1.1-1   2017-11-21 local                           
 forcats      * 0.3.0   2018-02-19 CRAN (R 3.4.4)                  
 foreach      * 1.4.4   2017-12-12 CRAN (R 3.4.3)                  
 ggplot2      * 3.0.0   2018-07-03 CRAN (R 3.4.4)                  
 git2r          0.23.0  2018-07-17 CRAN (R 3.4.4)                  
 glue           1.3.0   2018-07-17 CRAN (R 3.4.4)                  
 graphics     * 3.4.4   2018-03-16 local                           
 grDevices    * 3.4.4   2018-03-16 local                           
 grid           3.4.4   2018-03-16 local                           
 gtable         0.2.0   2016-02-26 CRAN (R 3.4.0)                  
 htmlTable    * 1.12    2018-05-26 CRAN (R 3.4.4)                  
 htmltools      0.3.6   2017-04-28 CRAN (R 3.4.2)                  
 htmlwidgets    1.2     2018-04-19 CRAN (R 3.4.4)                  
 httr           1.3.1   2017-08-20 CRAN (R 3.4.2)                  
 igraph       * 1.1.2   2017-07-21 CRAN (R 3.4.2)                  
 iterators    * 1.0.10  2018-07-13 CRAN (R 3.4.4)                  
 jsonlite       1.5     2017-06-01 CRAN (R 3.4.2)                  
 knitr          1.20    2018-02-20 CRAN (R 3.4.3)                  
 labeling       0.3     2014-08-23 CRAN (R 3.4.0)                  
 lattice        0.20-35 2017-03-25 CRAN (R 3.3.3)                  
 lazyeval       0.2.1   2017-10-29 CRAN (R 3.4.2)                  
 lubridate      1.7.4   2018-04-11 CRAN (R 3.4.4)                  
 magrittr       1.5     2014-11-22 CRAN (R 3.4.0)                  
 Matrix         1.2-14  2018-04-09 CRAN (R 3.4.4)                  
 memoise        1.1.0   2017-04-21 CRAN (R 3.4.3)                  
 methods      * 3.4.4   2018-03-16 local                           
 munsell        0.5.0   2018-06-12 CRAN (R 3.4.4)                  
 parallel     * 3.4.4   2018-03-16 local                           
 pillar         1.3.0   2018-07-14 CRAN (R 3.4.4)                  
 pkgconfig      2.0.2   2018-08-16 CRAN (R 3.4.4)                  
 plyr           1.8.4   2016-06-08 CRAN (R 3.4.0)                  
 purrr          0.2.5   2018-05-29 CRAN (R 3.4.4)                  
 qdapRegex      0.7.2   2017-04-09 CRAN (R 3.4.2)                  
 quanteda     * 1.3.4   2018-07-15 CRAN (R 3.4.4)                  
 R2HTML       * 2.3.2   2016-06-23 CRAN (R 3.4.3)                  
 R6             2.2.2   2017-06-17 CRAN (R 3.4.1)                  
 RColorBrewer   1.1-2   2014-12-07 CRAN (R 3.4.1)                  
 Rcpp           0.12.18 2018-07-23 CRAN (R 3.4.4)                  
 RcppParallel   4.4.1   2018-07-19 CRAN (R 3.4.4)                  
 readtext     * 0.71    2018-05-10 CRAN (R 3.4.4)                  
 rlang          0.2.2   2018-08-16 CRAN (R 3.4.4)                  
 rlist        * 0.4.6.1 2016-04-04 CRAN (R 3.4.4)                  
 rmarkdown      1.10    2018-06-11 CRAN (R 3.4.4)                  
 rprojroot      1.3-2   2018-01-03 CRAN (R 3.4.3)                  
 rstudioapi     0.7     2017-09-07 CRAN (R 3.4.3)                  
 scales       * 1.0.0   2018-08-09 CRAN (R 3.4.4)                  
 spacyr         0.9.91  2018-08-30 Github (quanteda/spacyr@240b6ef)
 stats        * 3.4.4   2018-03-16 local                           
 stopwords      0.9.0   2017-12-14 CRAN (R 3.4.3)                  
 stringi        1.2.4   2018-07-20 CRAN (R 3.4.4)                  
 stringr      * 1.3.1   2018-05-10 CRAN (R 3.4.4)                  
 textclean    * 0.9.3   2018-07-23 CRAN (R 3.4.4)                  
 tibble         1.4.2   2018-01-22 CRAN (R 3.4.3)                  
 tidyselect     0.2.4   2018-02-26 CRAN (R 3.4.3)                  
 tools          3.4.4   2018-03-16 local                           
 udpipe       * 0.6.1   2018-07-30 CRAN (R 3.4.4)                  
 utils        * 3.4.4   2018-03-16 local                           
 withr          2.1.2   2018-03-15 CRAN (R 3.4.4)                  
 yaml           2.2.0   2018-07-25 CRAN (R 3.4.4)  
amatsuo commented 6 years ago

It seems that you forgot to load the package by library(spacyr).

jwijffels commented 6 years ago

If you just want to get the lemma's in French using udpipe and put it into the quanteda corpus structure. I think this is just this (example below just takes nouns & proper nouns).

library(udpipe)
library(quanteda)
udmodel <- udpipe_load_model("french-ud-2.0-170801.udpipe")
## assuming that txtc is a quanteda corpus
x <- udpipe_annotate(udmodel, x = texts(txtc), doc_id = docnames(txtc), parser = "none")
x <- as.data.frame(x)
x <- subset(x, upos %in% c('NOUN', 'PROPN'))
txtc$tokens <- split(x$lemma, x$doc_id)

Why do you think such code would have to be put into the udpipe R package?

aourednik commented 6 years ago

@amatsuo Yes, my mistake, forgot to reload package, the first error was due this, sorry. Now I am getting only the second error on my machine (same Session info as before) :

> class(txtc)
[1] "corpus" "list"  
> txtc
Corpus consisting of 35,701 documents and 5 docvars.
> devtools::install_github("quanteda/spacyr", ref = "tokenize-function",force=TRUE)
Downloading GitHub repo quanteda/spacyr@tokenize-function
from URL https://api.github.com/repos/quanteda/spacyr/zipball/tokenize-function
Installing spacyr
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL  \
  '/tmp/Rtmp3TNiYi/devtoolsc003f381001/quanteda-spacyr-240b6ef'  \
  --library='/home/andre/R/x86_64-pc-linux-gnu-library/3.4' --install-tests 

* installing *source* package ‘spacyr’ ...
** R
** data
*** moving datasets to lazyload DB
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (spacyr)
Reloading installed spacyr
unloadNamespace("spacyr") not successful, probably because another loaded package depends on it.Forcing unload. If you encounter problems, please restart R.

Attaching package: ‘spacyr’

The following object is masked from ‘package:quanteda’:

    spacy_parse

> library("spacyr")
> parsed <- spacy_tokenize(corpus_sample(txtc,10))
Error in UseMethod("spacy_tokenize") : 
  no applicable method for 'spacy_tokenize' applied to an object of class "c('corpus', 'list')"
aourednik commented 6 years ago

@jwijffels Many thanks for the code! It currently returns a named list of character vectors containing lemmatized tokens, which comes much closer to what I (and most probably other users of both Quanteda and udpipe) would need. The best, though, would be having a udpipe function return a Quanteda object of class tokens. A tokens object is normally generated by tokens() or by the new spacy_tokenize() discussed here. The tokens object can be easily turned to a document-feature-matrix with dfm() that allows, for instance, fast dictionary lookup with dfm_lookup(). My concrete use-case is lexicon-based sentiment analysis and emotion mining.

jwijffels commented 6 years ago

quanteda's tokens element of the corpus seems to be a list of terms with the class tokenizedTexts. If you want this, just wrap the code that I showed above in as.tokenizedTexts which is part of quanteda.

library(udpipe)
library(quanteda)
udmodel <- udpipe_load_model("french-ud-2.0-170801.udpipe")
## assuming that txtc is a quanteda corpus
x <- udpipe_annotate(udmodel, x = texts(txtc), doc_id = docnames(txtc), parser = "none")
x <- as.data.frame(x)
txtc$tokens <- as.tokenizedTexts(split(x$lemma, x$doc_id))

If you want to use udpipe, to get a DTM/document-feature-matrix of adjectives for sentiment analysis, you can just use the code below and proceed with e.g. dfm_lookup if you need it.

## For sentiment analysis, with udpipe, just take the adjectives and get a dtm
x <- subset(x, upos %in% c('ADJ'))
dtm <- document_term_frequencies(x, document = "doc_id", term = "lemma")
dtm <- document_term_matrix(dtm)
ChengYJon commented 6 years ago

This has been super useful! Thank you!

Are there any plans to implement spacy's neural coreference functions into R?

kasperwelbers commented 5 years ago

@ChengYJon I was also looking to use the neuralcoref pipeline component, so I took a stab at it in this fork

There is some hassle though (as explained in the README), because neuralcoref currently doesn't seem to work with spacy > 2.0.12. Simply downgrading spacy in turn resulted in other compatibility issues, so for me a clean conda install was required. Until these compatibility issues are resolved it's quite cumbersome.

ChengYJon commented 5 years ago

@kasperwelbers Thank you so much for this. I kept having to switch between Python and R. I'll try this fork out and let you know if I'm able to recreate the process.

fkrauer commented 5 years ago

If it isn't already incorporated (I haven't found anything), I'd love to have a "start" and "end" character for each token. Otherwise they cannot be uniquely identified in the running text.

amatsuo commented 5 years ago

@fkrauer Thank you for the post.

I am not sure what that means by start and end.

Could you elaborate it a bit more? Or could you show us a desirable output?

fkrauer commented 5 years ago

I mean the character position of each token with respect to the original text. For example:

text <- "This is a dummy text."
output <- spacy_parse(text)

> output
token   start   end
This    1   4
is  6   7
a   9   9
dummy   11  15
text    17  20
.   21  21

The count starts with 1 at the first character, and all characters are counted (also whitespaces). coreNLP (R wrapper for Stanford's CoreNLP) has this feature, which is very useful, when you have to map the original text back onto the tokens or compare different NLP algorithms.

amatsuo commented 5 years ago

I see. It's not implemented in spacyr, but you could do something like this.

library(spacyr)
library(tidyverse)

txt <- c(doc1 = "spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. Independent research in 2015 found spaCy to be the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using.",
         doc2 = "spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python's awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.")
out <- spacy_parse(txt, additional_attributes = c("idx"), entity = FALSE,
                   lemma = FALSE, pos = FALSE)

out %>%
    mutate(start = idx - idx[1] + 1) %>%
    mutate(end = start + nchar(token) - 1) 

What the code does is:

  1. run spacy_parse with additional attribute of idx, which returns the character offset of the token in the document.
  2. calculate start and end.

The head of output is:

##    doc_id sentence_id token_id       token idx start end
## 1    doc1           1        1       spaCy   0     1   5
## 2    doc1           1        2      excels   6     7  12
## 3    doc1           1        3          at  13    14  15
## 4    doc1           1        4       large  16    17  21
## 5    doc1           1        5           -  21    22  22
## 6    doc1           1        6       scale  22    23  27
## 7    doc1           1        7 information  28    29  39
## 8    doc1           1        8  extraction  40    41  50
## 9    doc1           1        9       tasks  51    52  56
## 10   doc1           1       10           .  56    57  57
## 11   doc1           2        1          It  58    59  60
## 12   doc1           2        2          's  60    61  62
## 13   doc1           2        3     written  63    64  70
## 14   doc1           2        4        from  71    72  75
## 15   doc1           2        5         the  76    77  79
## 16   doc1           2        6      ground  80    81  86
## 17   doc1           2        7          up  87    88  89
## 18   doc1           2        8          in  90    91  92
## 19   doc1           2        9   carefully  93    94 102
## 20   doc1           2       10      memory 103   104 109

I am not sure whether we should provide this as a functionality of spacy_parse yet, but could be.

fkrauer commented 5 years ago

I have written a for loop with a stringr::str_locate(), but your solution is much quicker, thank you.

mshariful commented 4 years ago

spacyr_initialize

Hi, I have saved an updated Spacy NER model in 'c\updated_model'. The folder 'updated_model' contains 'tagger', 'parser', 'ner', and 'vocab' folders together with two files 'meta,json' and 'tokenizer'. I can easily load and use this updated model in python by simply using spacy.load( 'c\updated_model') How do I load it in Spacyr? I tried spacy_initialize(model='c\updated_model') I did not get any error but it seems spacyr uses the default 'de' model. How do I make sure, spacyr uses my updated model?

TIA Sharif