Closed jwijffels closed 6 years ago
Yes! That would be a great idea actually. I was not aware of the udpipe package, but had been hoping someone would implement something that does not depend on Python or Java. More updates soon.
Great! If you need any help or input from me on this, let me know.
I just pushed cleanNLP version 2.0.0 to GitHub. It is a major re-write of the annotation tasks so it still needs some testing, but it now supports the udpipe backed. Load from GH:
devtools::install_github("statsmaths/cleanNLP")
And then a minimal working example is given with this input data:
text <- c("It is better to be looked over than overlooked.",
"Real stupidity beats artificial intelligence every time.",
"The secret of getting ahead is getting started.")
input <- data.frame(doc_id = c("West", "Pratchett", "Twain"),
text = text,
stringsAsFactors = FALSE)
Parsed with:
library(cleanNLP)
cnlp_init_udpipe()
output <- cnlp_get_tif(cnlp_annotate_tif(input))
print.data.frame(head(output))
## doc_id sid tid word lemma upos pos cid pid case definite degree
## 1 West 1 1 It it PRON PRP 0 1 Nom <NA> <NA>
## 2 West 1 2 is be AUX VBZ 3 1 <NA> <NA> <NA>
## 3 West 1 3 better better ADJ JJR 6 1 <NA> <NA> Cmp
## 4 West 1 4 to to PART TO 13 1 <NA> <NA> <NA>
## 5 West 1 5 be be AUX VB 16 1 <NA> <NA> <NA>
## 6 West 1 6 looked look VERB VBN 19 1 <NA> <NA> <NA>
## gender mood number person pron_type tense verb_form voice source
## 1 Neut <NA> Sing 3 Prs <NA> <NA> <NA> 3
## 2 <NA> Ind Sing 3 <NA> Pres Fin <NA> 3
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 0
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 6
## 5 <NA> <NA> <NA> <NA> <NA> <NA> Inf <NA> 6
## 6 <NA> <NA> <NA> <NA> <NA> Past Part Pass 3
## relation word_source lemma_source spaces
## 1 expl better better 1
## 2 cop better better 1
## 3 root ROOT ROOT 1
## 4 mark looked look 1
## 5 aux:pass looked look 1
## 6 csubj better better 1
Please let me know if you have any thoughts or suggestions on how it has been incorporated. Thanks again for letting me know about the package!
Waw. Great effort!
I've had a look to the code. Here some remarks:
cnlp_get_features
but I'm not sure on that.location
or something similar. library(udpipe)
m <- udpipe_download_model("german")
m <- udpipe_load_model(m$file_model)
x <- udpipe_annotate(m, x = "Wir gehen zum kino")
as.data.frame(x)
doc_id paragraph_id sentence_id sentence token_id token lemma upos xpos feats head_token_id
1 doc1 1 1 Wir gehen zum kino 1 Wir wir PRON PPER Case=Nom|Number=Plur|Person=1|PronType=Prs 2
2 doc1 1 1 Wir gehen zum kino 2 gehen gehen VERB VVFIN Number=Plur|Person=1|VerbForm=Fin 0
3 doc1 1 1 Wir gehen zum kino 3-4 zum <NA> <NA> <NA> <NA> <NA>
4 doc1 1 1 Wir gehen zum kino 3 zu zu ADP APPR <NA> 5
5 doc1 1 1 Wir gehen zum kino 4 dem der DET ART Case=Dat|Definite=Def|Gender=Masc,Neut|Number=Sing|PronType=Art 5
6 doc1 1 1 Wir gehen zum kino 5 kino kino PROPN NN Case=Dat|Gender=Masc,Neut|Number=Sing 2
library(cleanNLP)
cnlp_init_udpipe(model_name = "german")
anno <- cnlp_annotate("Wir gehen zum kino", as_strings = TRUE)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '3-4'
Example on locally trained model.
cnlp_init_udpipe(model_path = "C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dev/nl-lassysmall-token-tag.udpipe")
anno <- cnlp_annotate("Ik ga op reis naar de Caraiben.", as_strings = TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Browse[2]>
debug: anno <- udpipe::udpipe_annotate(volatiles$udpipe$model_obj, input_txt)
Browse[2]>
debug: out <- from_udpipe_CoNLL(anno$conllu)
Browse[2]> anno
$x
[1] "Ik ga op reis naar de Caraiben."
$conllu
[1] ""
$errors
[1] "No parser defined for the UDPipe model!"
attr(,"class")
[1] "udpipe_connlu"
## This works as follows:
m <- udpipe_load_model("C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dev/nl-lassysmall-token-tag.udpipe")
x <- udpipe_annotate(m, x = "Ik ga op reis naar de Caraiben.", parser = "none")
as.data.frame(x)
doc_id paragraph_id sentence_id sentence token_id token lemma upos xpos feats head_token_id dep_rel deps
1 doc1 1 1 Ik ga op reis naar de Caraiben. 1 Ik ik PRON VNW|pers|pron|nomin|vol|1|ev Case=Nom|Person=1|PronType=Prs <NA> <NA> <NA>
2 doc1 1 1 Ik ga op reis naar de Caraiben. 2 ga ga NOUN N|soort|mv|basis Number=Plur <NA> <NA> <NA>
3 doc1 1 1 Ik ga op reis naar de Caraiben. 3 op op ADP VZ|init <NA> <NA> <NA> <NA>
4 doc1 1 1 Ik ga op reis naar de Caraiben. 4 reis reis NOUN N|soort|ev|basis|zijd|stan Gender=Com|Number=Sing <NA> <NA> <NA>
5 doc1 1 1 Ik ga op reis naar de Caraiben. 5 naar naar ADP VZ|init <NA> <NA> <NA> <NA>
6 doc1 1 1 Ik ga op reis naar de Caraiben. 6 de de DET LID|bep|stan|rest Definite=Def <NA> <NA> <NA>
7 doc1 1 1 Ik ga op reis naar de Caraiben. 7 Caraiben Caraiben NOUN N|soort|ev|basis|zijd|stan Gender=Com|Number=Sing <NA> <NA> <NA>
8 doc1 1 1 Ik ga op reis naar de Caraiben. 8 . . PUNCT LET <NA> <NA> <NA> <NA>
Maybe for these models it makes sense that the model_name is being kept instead of calling it custom
cnlp_init_udpipe(model_path = "C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dev/nl-lassysmall-token-tag-parse.udpipe", model_name = "nl")
anno <- cnlp_annotate("Ik ga op reis naar de Caraiben.", as_strings = TRUE)
cnlp_get_document(anno)
doc_id time version language uri
doc1 2018-01-03 10:30:29 0.3 custom <NA>
Glad you like the package! It saves me al least a whole of installation problems when giving courses on text mining.
Thanks for quick and thorough feedback. Yes, I agree that this will be a great benefit to teaching and casual users. I just uploaded version 2.0.1 to GitHub with the following changes:
Here is an example of the features and parsers turned off:
library(cleanNLP)
cnlp_init_udpipe(feature_flag = FALSE, parser = "none")
output <- cnlp_get_tif(cnlp_annotate("It is better to be looked over than overlooked."))
print.data.frame(head(output))
doc_id sid tid word lemma upos pos cid pid spaces
1 doc1 1 1 It it PRON PRP 0 1 1
2 doc1 1 2 is be AUX VBZ 3 1 1
3 doc1 1 3 better better ADJ JJR 6 1 1
4 doc1 1 4 to to PART TO 13 1 1
5 doc1 1 5 be be AUX VB 16 1 1
6 doc1 1 6 looked look VERB VBN 19 1 1
And one parsing the German text you referenced:
library(cleanNLP)
cnlp_init_udpipe(model_name = "german")
anno <- cnlp_annotate("Wir gehen zum kino", as_strings = TRUE)
print.data.frame(head(cnlp_get_tif(anno)))
doc_id sid tid word lemma upos pos cid pid case definite gender number
1 doc1 1 1 Wir wir PRON PPER 0 1 Nom <NA> <NA> Plur
2 doc1 1 2 gehen gehen VERB VVFIN 4 1 <NA> <NA> <NA> Plur
3 doc1 1 3 zu zu ADP APPR 10 1 <NA> <NA> <NA> <NA>
4 doc1 1 4 dem der DET ART 13 1 Dat Def Masc,Neut Sing
5 doc1 1 5 kino kino PROPN NN 17 1 Dat <NA> Masc,Neut Sing
person pron_type verb_form source relation word_source lemma_source spaces
1 1 Prs <NA> 2 nsubj gehen gehen 1
2 1 <NA> Fin 0 root ROOT ROOT 1
3 <NA> <NA> <NA> 5 case kino kino 1
4 <NA> Art <NA> 5 det kino kino 1
5 <NA> <NA> <NA> 2 obl gehen gehen 0
If you have any other suggestions I would greatly appreciate them!
Thank you for the changes. This looks good to me. Thank you for all the effort!
@statsmaths FYI. udpipe_download_model
now has gained an extra argument called udpipe_model_repo indicating from which github repository to download the model from.
You can either indicate in that argument
The default is 'jwijffels/udpipe.models.ud.2.0'
Would it make sense to also add the udpipe r package as a backend. That package also has no external dependencies and provides tokenisation, lemmatisation, pos tagging, feature tagging and dependency parsing. Package available at https://cran.r-project.org/web/packages/udpipe/index.html