Open amatsuo opened 6 years ago
Having a "just tokenization" option with lemmatization would be great. Currently trying to use
parsed <- spacy_parse(my_corpus, pos=FALSE, entity=FALSE, dependency=FALSE)
parsed$token <- parsed$lemma
my_tokens <- as.tokens(parsed)
The first line yields a memory overload on a large my_corpus, while tokens(my_corpus) is fast, with no memory problem. I don't to what extent this is due to inherent memory use of spaCy, though.
Could spacyr
somehow be included as an option with the tokens function? Like this: ?
my_tokens <- tokens(txtc,
what='word',
remove_numbers = TRUE,
remove_punct = TRUE,
remove_separators=TRUE,
remove_symbols = TRUE,
include_docvars = TRUE,
lemmatize = "spacy_parse"
)
Not a bad idea. @amatsuo maybe add:
spacy_tokenize(x, what = c("word", "sentence"),
remove_numbers = FALSE, remove_punct = FALSE,
remove_symbols = FALSE, remove_separators = TRUE,
remove_twitter = FALSE, remove_hyphens = FALSE,
remove_url = FALSE, value = c("list", "data.frame")
where the last one returns one of the two TIF formats for tokens? This is as close to the quanteda::tokens()
as possible and with spacy_tokenize(x, value = "list") %>% as.tokens()
provides the options of going straight to a quanteda tokens class using the spaCy tokeniser.
We could also add to spacy_parse()
a new option for sentence = TRUE
that would remove the sentence_id
return field, and number tokens consecutively within document. So if all options are FALSE
, it's the same as spacy_tokenize(x, what = "word", value = "data.frame")
-- indeed, that function could call this version of spacy_parse()
.
Definitely would be interested in noun phrase extractions.
Hi @dmklotz
I opened an issue for noun-phrase extraction (#117). Please provide your thoughts there.
@aourednik and @kbenoit
I have implemented spacy_tokenize
in tokenize-function
branch. Please try and give some feedback to me.
Some options are left out: remove_symbols
, remove_hyphens
, remove_twitter
. In my opinion, these options are about text-preprocessing before handing texts to spaCy NLP. At the moment, spacyr
does not import stringi
and I don't see much reason to use gsub()
in 2018 for potentially large-scale text processing.
Is it possible to train a new model with spacyr at the moment?
@cecilialee No, for training a new language model you would need to do that in Python using the spaCy instructions. We unlikely to add this facility to spacyr in the foreseeable future.
@kbenoit Sure. Then if I've trained a model with python, how can I use (initialize) that model with spacyr?
@cecilialee
The model
argument of spacyr_initialize
is handed to the model name argument of spacy.load('**')
. So you should be able to use the name of the model you saved in python when you call spacy_initialize
.
@amatsuo Is there a simple way to install the full tokenize-function branch version of spacyr in R ?
@aourednik that would be
devtools::install_github("quanteda/spacyr", ref = "tokenize-function")
Great thanks for these developments! By the way, this has more to do with Quanteda in general than with spacyr, but since we are speaking of lemmatization, I was wondering if it would it be feasible to implement a udpipe lemmatizer in the totokens() function ? Or something like udpipe_tokenize() taking a Quanteda corpus as argument and returning lemmatized tokens? UDPipe is reported to perform better, though slower, lemmatization for French, Italian and Spanish than SpaCy. For now, I can get a list of lists of tokens like this (below) but having a Quanteda toknes object would allow me to remain within the Quanteda framework.
library("udpipe")
# dl <- udpipe_download_model(language = "french") # necessary only when not yet downloaded
udmodel_french <- udpipe_load_model(file = "french-ud-2.0-170801.udpipe")
#txtc is my quanteda corpus
txtudpipetokens <- lapply(head(texts(txtc)), function(x) {
udp <- udpipe_annotate(udmodel_french, x)
return(as.data.table(udp)$lemma)
}
)
cf. https://github.com/bnosac/udpipe @amatsuo @jwijffels
Glad it's working for you! We should be finished with the integration of the tokenize-function
branch next week. When that's completed, it will be very easy to use spacyr for tokenisation or lemmatising.
On integration with udpipe, that's probably better done in that package. @jwijffels we'd be happy to assist with this.
@amatsuo @kbenoit I have tried out:
devtools::install_github("quanteda/spacyr", ref = "tokenize-function")
parsed <- spacy_tokenize(corpus_sample(txtc,10))
#Error in spacy_tokenize(corpus_sample(txtc, 10)) :
# could not find function "spacy_tokenize"
source("https://raw.githubusercontent.com/quanteda/spacyr/tokenize-function/R/spacy_tokenize.R")
parsed <- spacy_tokenize(corpus_sample(txtc,10))
#Error in UseMethod("spacy_tokenize") :
# no applicable method for 'spacy_tokenize' applied to an object of class "c('corpus', 'list')"
Session info -------------------------------------------------------------------------------------------------
setting value
version R version 3.4.4 (2018-03-15)
system x86_64, linux-gnu
ui RStudio (1.1.423)
language en_US
collate en_US.UTF-8
tz Europe/Zurich
date 2018-08-30
Packages -----------------------------------------------------------------------------------------------------
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.4.1)
backports 1.1.2 2017-12-13 CRAN (R 3.4.3)
base * 3.4.4 2018-03-16 local
base64enc 0.1-3 2015-07-28 CRAN (R 3.4.2)
bindr 0.1.1 2018-03-13 CRAN (R 3.4.3)
bindrcpp 0.2.2 2018-03-29 CRAN (R 3.4.4)
checkmate 1.8.5 2017-10-24 CRAN (R 3.4.3)
codetools 0.2-15 2016-10-05 CRAN (R 3.3.1)
colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
compiler 3.4.4 2018-03-16 local
crayon 1.3.4 2017-09-16 CRAN (R 3.4.3)
curl 3.2 2018-03-28 CRAN (R 3.4.4)
data.table * 1.11.4 2018-05-27 CRAN (R 3.4.4)
datasets * 3.4.4 2018-03-16 local
devtools 1.13.6 2018-06-27 CRAN (R 3.4.4)
digest 0.6.16 2018-08-22 CRAN (R 3.4.4)
doMC * 1.3.5 2017-12-12 CRAN (R 3.4.3)
dplyr 0.7.6 2018-06-29 CRAN (R 3.4.4)
evaluate 0.11 2018-07-17 CRAN (R 3.4.4)
fastmatch 1.1-1 2017-11-21 local
forcats * 0.3.0 2018-02-19 CRAN (R 3.4.4)
foreach * 1.4.4 2017-12-12 CRAN (R 3.4.3)
ggplot2 * 3.0.0 2018-07-03 CRAN (R 3.4.4)
git2r 0.23.0 2018-07-17 CRAN (R 3.4.4)
glue 1.3.0 2018-07-17 CRAN (R 3.4.4)
graphics * 3.4.4 2018-03-16 local
grDevices * 3.4.4 2018-03-16 local
grid 3.4.4 2018-03-16 local
gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
htmlTable * 1.12 2018-05-26 CRAN (R 3.4.4)
htmltools 0.3.6 2017-04-28 CRAN (R 3.4.2)
htmlwidgets 1.2 2018-04-19 CRAN (R 3.4.4)
httr 1.3.1 2017-08-20 CRAN (R 3.4.2)
igraph * 1.1.2 2017-07-21 CRAN (R 3.4.2)
iterators * 1.0.10 2018-07-13 CRAN (R 3.4.4)
jsonlite 1.5 2017-06-01 CRAN (R 3.4.2)
knitr 1.20 2018-02-20 CRAN (R 3.4.3)
labeling 0.3 2014-08-23 CRAN (R 3.4.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.3.3)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
lubridate 1.7.4 2018-04-11 CRAN (R 3.4.4)
magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
Matrix 1.2-14 2018-04-09 CRAN (R 3.4.4)
memoise 1.1.0 2017-04-21 CRAN (R 3.4.3)
methods * 3.4.4 2018-03-16 local
munsell 0.5.0 2018-06-12 CRAN (R 3.4.4)
parallel * 3.4.4 2018-03-16 local
pillar 1.3.0 2018-07-14 CRAN (R 3.4.4)
pkgconfig 2.0.2 2018-08-16 CRAN (R 3.4.4)
plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
purrr 0.2.5 2018-05-29 CRAN (R 3.4.4)
qdapRegex 0.7.2 2017-04-09 CRAN (R 3.4.2)
quanteda * 1.3.4 2018-07-15 CRAN (R 3.4.4)
R2HTML * 2.3.2 2016-06-23 CRAN (R 3.4.3)
R6 2.2.2 2017-06-17 CRAN (R 3.4.1)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.1)
Rcpp 0.12.18 2018-07-23 CRAN (R 3.4.4)
RcppParallel 4.4.1 2018-07-19 CRAN (R 3.4.4)
readtext * 0.71 2018-05-10 CRAN (R 3.4.4)
rlang 0.2.2 2018-08-16 CRAN (R 3.4.4)
rlist * 0.4.6.1 2016-04-04 CRAN (R 3.4.4)
rmarkdown 1.10 2018-06-11 CRAN (R 3.4.4)
rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
rstudioapi 0.7 2017-09-07 CRAN (R 3.4.3)
scales * 1.0.0 2018-08-09 CRAN (R 3.4.4)
spacyr 0.9.91 2018-08-30 Github (quanteda/spacyr@240b6ef)
stats * 3.4.4 2018-03-16 local
stopwords 0.9.0 2017-12-14 CRAN (R 3.4.3)
stringi 1.2.4 2018-07-20 CRAN (R 3.4.4)
stringr * 1.3.1 2018-05-10 CRAN (R 3.4.4)
textclean * 0.9.3 2018-07-23 CRAN (R 3.4.4)
tibble 1.4.2 2018-01-22 CRAN (R 3.4.3)
tidyselect 0.2.4 2018-02-26 CRAN (R 3.4.3)
tools 3.4.4 2018-03-16 local
udpipe * 0.6.1 2018-07-30 CRAN (R 3.4.4)
utils * 3.4.4 2018-03-16 local
withr 2.1.2 2018-03-15 CRAN (R 3.4.4)
yaml 2.2.0 2018-07-25 CRAN (R 3.4.4)
It seems that you forgot to load the package by library(spacyr)
.
If you just want to get the lemma's in French using udpipe and put it into the quanteda corpus structure. I think this is just this (example below just takes nouns & proper nouns).
library(udpipe)
library(quanteda)
udmodel <- udpipe_load_model("french-ud-2.0-170801.udpipe")
## assuming that txtc is a quanteda corpus
x <- udpipe_annotate(udmodel, x = texts(txtc), doc_id = docnames(txtc), parser = "none")
x <- as.data.frame(x)
x <- subset(x, upos %in% c('NOUN', 'PROPN'))
txtc$tokens <- split(x$lemma, x$doc_id)
Why do you think such code would have to be put into the udpipe R package?
@amatsuo Yes, my mistake, forgot to reload package, the first error was due this, sorry. Now I am getting only the second error on my machine (same Session info as before) :
> class(txtc)
[1] "corpus" "list"
> txtc
Corpus consisting of 35,701 documents and 5 docvars.
> devtools::install_github("quanteda/spacyr", ref = "tokenize-function",force=TRUE)
Downloading GitHub repo quanteda/spacyr@tokenize-function
from URL https://api.github.com/repos/quanteda/spacyr/zipball/tokenize-function
Installing spacyr
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL \
'/tmp/Rtmp3TNiYi/devtoolsc003f381001/quanteda-spacyr-240b6ef' \
--library='/home/andre/R/x86_64-pc-linux-gnu-library/3.4' --install-tests
* installing *source* package ‘spacyr’ ...
** R
** data
*** moving datasets to lazyload DB
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (spacyr)
Reloading installed spacyr
unloadNamespace("spacyr") not successful, probably because another loaded package depends on it.Forcing unload. If you encounter problems, please restart R.
Attaching package: ‘spacyr’
The following object is masked from ‘package:quanteda’:
spacy_parse
> library("spacyr")
> parsed <- spacy_tokenize(corpus_sample(txtc,10))
Error in UseMethod("spacy_tokenize") :
no applicable method for 'spacy_tokenize' applied to an object of class "c('corpus', 'list')"
@jwijffels Many thanks for the code! It currently returns a named list of character vectors containing lemmatized tokens, which comes much closer to what I (and most probably other users of both Quanteda and udpipe) would need. The best, though, would be having a udpipe function return a Quanteda object of class tokens. A tokens object is normally generated by tokens()
or by the new spacy_tokenize()
discussed here. The tokens object can be easily turned to a document-feature-matrix with dfm()
that allows, for instance, fast dictionary lookup with dfm_lookup()
.
My concrete use-case is lexicon-based sentiment analysis and emotion mining.
quanteda's tokens element of the corpus seems to be a list of terms with the class tokenizedTexts
. If you want this, just wrap the code that I showed above in as.tokenizedTexts
which is part of quanteda.
library(udpipe)
library(quanteda)
udmodel <- udpipe_load_model("french-ud-2.0-170801.udpipe")
## assuming that txtc is a quanteda corpus
x <- udpipe_annotate(udmodel, x = texts(txtc), doc_id = docnames(txtc), parser = "none")
x <- as.data.frame(x)
txtc$tokens <- as.tokenizedTexts(split(x$lemma, x$doc_id))
If you want to use udpipe
, to get a DTM/document-feature-matrix of adjectives for sentiment analysis, you can just use the code below and proceed with e.g. dfm_lookup if you need it.
## For sentiment analysis, with udpipe, just take the adjectives and get a dtm
x <- subset(x, upos %in% c('ADJ'))
dtm <- document_term_frequencies(x, document = "doc_id", term = "lemma")
dtm <- document_term_matrix(dtm)
This has been super useful! Thank you!
Are there any plans to implement spacy's neural coreference functions into R?
@ChengYJon I was also looking to use the neuralcoref pipeline component, so I took a stab at it in this fork
There is some hassle though (as explained in the README), because neuralcoref currently doesn't seem to work with spacy > 2.0.12. Simply downgrading spacy in turn resulted in other compatibility issues, so for me a clean conda install was required. Until these compatibility issues are resolved it's quite cumbersome.
@kasperwelbers Thank you so much for this. I kept having to switch between Python and R. I'll try this fork out and let you know if I'm able to recreate the process.
If it isn't already incorporated (I haven't found anything), I'd love to have a "start" and "end" character for each token. Otherwise they cannot be uniquely identified in the running text.
@fkrauer Thank you for the post.
I am not sure what that means by start
and end
.
Could you elaborate it a bit more? Or could you show us a desirable output?
I mean the character position of each token with respect to the original text. For example:
text <- "This is a dummy text."
output <- spacy_parse(text)
> output
token start end
This 1 4
is 6 7
a 9 9
dummy 11 15
text 17 20
. 21 21
The count starts with 1 at the first character, and all characters are counted (also whitespaces). coreNLP (R wrapper for Stanford's CoreNLP) has this feature, which is very useful, when you have to map the original text back onto the tokens or compare different NLP algorithms.
I see. It's not implemented in spacyr
, but you could do something like this.
library(spacyr)
library(tidyverse)
txt <- c(doc1 = "spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. Independent research in 2015 found spaCy to be the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using.",
doc2 = "spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python's awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.")
out <- spacy_parse(txt, additional_attributes = c("idx"), entity = FALSE,
lemma = FALSE, pos = FALSE)
out %>%
mutate(start = idx - idx[1] + 1) %>%
mutate(end = start + nchar(token) - 1)
What the code does is:
spacy_parse
with additional attribute of idx
, which returns the character offset of the token in the document.start
and end
. The head of output is:
## doc_id sentence_id token_id token idx start end
## 1 doc1 1 1 spaCy 0 1 5
## 2 doc1 1 2 excels 6 7 12
## 3 doc1 1 3 at 13 14 15
## 4 doc1 1 4 large 16 17 21
## 5 doc1 1 5 - 21 22 22
## 6 doc1 1 6 scale 22 23 27
## 7 doc1 1 7 information 28 29 39
## 8 doc1 1 8 extraction 40 41 50
## 9 doc1 1 9 tasks 51 52 56
## 10 doc1 1 10 . 56 57 57
## 11 doc1 2 1 It 58 59 60
## 12 doc1 2 2 's 60 61 62
## 13 doc1 2 3 written 63 64 70
## 14 doc1 2 4 from 71 72 75
## 15 doc1 2 5 the 76 77 79
## 16 doc1 2 6 ground 80 81 86
## 17 doc1 2 7 up 87 88 89
## 18 doc1 2 8 in 90 91 92
## 19 doc1 2 9 carefully 93 94 102
## 20 doc1 2 10 memory 103 104 109
I am not sure whether we should provide this as a functionality of spacy_parse yet, but could be.
I have written a for loop with a stringr::str_locate(), but your solution is much quicker, thank you.
spacyr_initialize
Hi,
I have saved an updated Spacy NER model in 'c\updated_model'. The folder 'updated_model' contains
'tagger', 'parser', 'ner', and 'vocab' folders together with two files 'meta,json' and 'tokenizer'. I can easily
load and use this updated model in python by simply using
spacy.load( 'c\updated_model')
How do I load it in Spacyr? I tried
spacy_initialize(model='c\updated_model')
I did not get any error but it seems spacyr uses the default 'de' model. How do I make sure, spacyr uses my updated model?
TIA Sharif
Hello @kbenoit and other users of
spacyr
This is a comprehensive wishlist of
spacyr
updates inspired by our discussion with @honnibal and @ines. We will implement some of them in future, but is there anything you are particularly intereted in?Something likely to be implemented
Something nice to have but not sure how many users need it
Just a wish