Udpipe as backend - Githubissues

Would it make sense to also add the udpipe r package as a backend. That package also has no external dependencies and provides tokenisation, lemmatisation, pos tagging, feature tagging and dependency parsing. Package available at https://cran.r-project.org/web/packages/udpipe/index.html

Yes! That would be a great idea actually. I was not aware of the udpipe package, but had been hoping someone would implement something that does not depend on Python or Java. More updates soon.

Great! If you need any help or input from me on this, let me know.

I just pushed cleanNLP version 2.0.0 to GitHub. It is a major re-write of the annotation tasks so it still needs some testing, but it now supports the udpipe backed. Load from GH:

devtools::install_github("statsmaths/cleanNLP")

And then a minimal working example is given with this input data:

text <- c("It is better to be looked over than overlooked.",
         "Real stupidity beats artificial intelligence every time.",
         "The secret of getting ahead is getting started.")
input <- data.frame(doc_id = c("West", "Pratchett", "Twain"),
                        text = text,
                        stringsAsFactors = FALSE)

Parsed with:

library(cleanNLP)
cnlp_init_udpipe()
output <- cnlp_get_tif(cnlp_annotate_tif(input))
print.data.frame(head(output))

##   doc_id sid tid   word  lemma upos pos cid pid case definite degree
## 1   West   1   1     It     it PRON PRP   0   1  Nom     <NA>   <NA>
## 2   West   1   2     is     be  AUX VBZ   3   1 <NA>     <NA>   <NA>
## 3   West   1   3 better better  ADJ JJR   6   1 <NA>     <NA>    Cmp
## 4   West   1   4     to     to PART  TO  13   1 <NA>     <NA>   <NA>
## 5   West   1   5     be     be  AUX  VB  16   1 <NA>     <NA>   <NA>
## 6   West   1   6 looked   look VERB VBN  19   1 <NA>     <NA>   <NA>
##   gender mood number person pron_type tense verb_form voice source
## 1   Neut <NA>   Sing      3       Prs  <NA>      <NA>  <NA>      3
## 2   <NA>  Ind   Sing      3      <NA>  Pres       Fin  <NA>      3
## 3   <NA> <NA>   <NA>   <NA>      <NA>  <NA>      <NA>  <NA>      0
## 4   <NA> <NA>   <NA>   <NA>      <NA>  <NA>      <NA>  <NA>      6
## 5   <NA> <NA>   <NA>   <NA>      <NA>  <NA>       Inf  <NA>      6
## 6   <NA> <NA>   <NA>   <NA>      <NA>  Past      Part  Pass      3
##   relation word_source lemma_source spaces
## 1     expl      better       better      1
## 2      cop      better       better      1
## 3     root        ROOT         ROOT      1
## 4     mark      looked         look      1
## 5 aux:pass      looked         look      1
## 6    csubj      better       better      1

Please let me know if you have any thoughts or suggestions on how it has been incorporated. Thanks again for letting me know about the package!

Waw. Great effort!

I've had a look to the code. Here some remarks:

I see you started working from the conllu output. One important note is that the token identifiers are not always integers. E.g. in German the word zum is splitted out in 2 tokens (zu and dem) as shown below. Which will cause issues.
There are quite a few possible features (they are defined at http://universaldependencies.org/u/feat/index.html), namely 21 and maybe more language-specific ones (as defined in http://universaldependencies.org/guidelines.html). Maybe it makes sense to have another grouping like cnlp_get_features but I'm not sure on that.
Maybe in the documentation of ?cnlp_init_udpipe you probably want to leave out the spacy / corenlp parts and maybe also add it in the DESCRIPTION. Currently the models are downloaded with udpipe_download_model from https://github.com/jwijffels/udpipe.models.ud.2.0. I might add another repository also where models are constructed using the udpipe R package on newer CONLLU files. I'll release these as CC-BY-SA if the CONLLU files are also made available under that license. So I might add in udpipe_download_model in a later stage an argument called location or something similar.
You can train your models to do only e.g. pos tagging or only dependency parsing. The models which are downloaded with udpipe_download_model do all. But I also have some local models which e.g. only do tokenisation and pos tagging. This gives errors in cleanNLP basically because there is no dependency parser trained. Example below. I was planning to make some blogposts on how to train a model soon but the vignette already shows the basics on how to do this.

library(udpipe)
m <- udpipe_download_model("german")
m <- udpipe_load_model(m$file_model)
x <- udpipe_annotate(m, x = "Wir gehen zum kino")
as.data.frame(x)
  doc_id paragraph_id sentence_id           sentence token_id token lemma  upos  xpos                                                           feats head_token_id
1   doc1            1           1 Wir gehen zum kino        1   Wir   wir  PRON  PPER                      Case=Nom|Number=Plur|Person=1|PronType=Prs             2
2   doc1            1           1 Wir gehen zum kino        2 gehen gehen  VERB VVFIN                               Number=Plur|Person=1|VerbForm=Fin             0
3   doc1            1           1 Wir gehen zum kino      3-4   zum  <NA>  <NA>  <NA>                                                            <NA>          <NA>
4   doc1            1           1 Wir gehen zum kino        3    zu    zu   ADP  APPR                                                            <NA>             5
5   doc1            1           1 Wir gehen zum kino        4   dem   der   DET   ART Case=Dat|Definite=Def|Gender=Masc,Neut|Number=Sing|PronType=Art             5
6   doc1            1           1 Wir gehen zum kino        5  kino  kino PROPN    NN                           Case=Dat|Gender=Masc,Neut|Number=Sing             2

library(cleanNLP)
cnlp_init_udpipe(model_name = "german")
anno <- cnlp_annotate("Wir gehen zum kino", as_strings = TRUE)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  scan() expected 'an integer', got '3-4'

Example on locally trained model.

cnlp_init_udpipe(model_path = "C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dev/nl-lassysmall-token-tag.udpipe")
anno <- cnlp_annotate("Ik ga op reis naar de Caraiben.", as_strings = TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  no lines available in input

Browse[2]> 
debug: anno <- udpipe::udpipe_annotate(volatiles$udpipe$model_obj, input_txt)
Browse[2]> 
debug: out <- from_udpipe_CoNLL(anno$conllu)
Browse[2]> anno
$x
[1] "Ik ga op reis naar de Caraiben."

$conllu
[1] ""

$errors
[1] "No parser defined for the UDPipe model!"

attr(,"class")
[1] "udpipe_connlu"

## This works as follows:
m <- udpipe_load_model("C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dev/nl-lassysmall-token-tag.udpipe")
x <- udpipe_annotate(m, x = "Ik ga op reis naar de Caraiben.", parser = "none")
as.data.frame(x)
  doc_id paragraph_id sentence_id                        sentence token_id    token    lemma  upos                         xpos                          feats head_token_id dep_rel deps
1   doc1            1           1 Ik ga op reis naar de Caraiben.        1       Ik       ik  PRON VNW|pers|pron|nomin|vol|1|ev Case=Nom|Person=1|PronType=Prs          <NA>    <NA> <NA>
2   doc1            1           1 Ik ga op reis naar de Caraiben.        2       ga       ga  NOUN             N|soort|mv|basis                    Number=Plur          <NA>    <NA> <NA>
3   doc1            1           1 Ik ga op reis naar de Caraiben.        3       op       op   ADP                      VZ|init                           <NA>          <NA>    <NA> <NA>
4   doc1            1           1 Ik ga op reis naar de Caraiben.        4     reis     reis  NOUN   N|soort|ev|basis|zijd|stan         Gender=Com|Number=Sing          <NA>    <NA> <NA>
5   doc1            1           1 Ik ga op reis naar de Caraiben.        5     naar     naar   ADP                      VZ|init                           <NA>          <NA>    <NA> <NA>
6   doc1            1           1 Ik ga op reis naar de Caraiben.        6       de       de   DET            LID|bep|stan|rest                   Definite=Def          <NA>    <NA> <NA>
7   doc1            1           1 Ik ga op reis naar de Caraiben.        7 Caraiben Caraiben  NOUN   N|soort|ev|basis|zijd|stan         Gender=Com|Number=Sing          <NA>    <NA> <NA>
8   doc1            1           1 Ik ga op reis naar de Caraiben.        8        .        . PUNCT                          LET                           <NA>          <NA>    <NA> <NA>

Maybe for these models it makes sense that the model_name is being kept instead of calling it custom

cnlp_init_udpipe(model_path = "C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dev/nl-lassysmall-token-tag-parse.udpipe", model_name = "nl")
anno <- cnlp_annotate("Ik ga op reis naar de Caraiben.", as_strings = TRUE)
cnlp_get_document(anno)
 doc_id                time version language  uri
  doc1 2018-01-03 10:30:29     0.3   custom <NA>

Glad you like the package! It saves me al least a whole of installation problems when giving courses on text mining.

Thanks for quick and thorough feedback. Yes, I agree that this will be a great benefit to teaching and casual users. I just uploaded version 2.0.1 to GitHub with the following changes:

cnlp_init_udpipe: now supports the option 'feature_flag', which lets users decide if they want to include the features in the output.
cnlp_init_udpipe: now also supports the option 'parser', to let users select whether they want the default, none, or a custom parser.
cnlp_init_udpipe: I fixed the documentation there that still referred to the spacy backend
cnlp_download_udpipe: I added an explicit function to download models and save in the correct place; it is still run automatically by the init function, but I included this to support allowing a custom location parameter as you mentioned.
language name: the language name now gives the full model file name regardless of whether it is a default or custom model; this gives much more useful information in both cases
multi-tokens: this is the hardest thing to fix because the design of cleanNLP assumes that token ids are unique integers. For now, I have fixed the code to just remove those rows corresponding to the multiline tokens by removing them. At least for the German example this makes sense (users still get the "zu" and "dem"). On my mental to-do list to figure out a better solution.

Here is an example of the features and parsers turned off:

library(cleanNLP)
cnlp_init_udpipe(feature_flag = FALSE, parser = "none")
output <- cnlp_get_tif(cnlp_annotate("It is better to be looked over than overlooked."))
print.data.frame(head(output))

  doc_id sid tid   word  lemma upos pos cid pid spaces
1   doc1   1   1     It     it PRON PRP   0   1      1
2   doc1   1   2     is     be  AUX VBZ   3   1      1
3   doc1   1   3 better better  ADJ JJR   6   1      1
4   doc1   1   4     to     to PART  TO  13   1      1
5   doc1   1   5     be     be  AUX  VB  16   1      1
6   doc1   1   6 looked   look VERB VBN  19   1      1

And one parsing the German text you referenced:

library(cleanNLP)
cnlp_init_udpipe(model_name = "german")
anno <- cnlp_annotate("Wir gehen zum kino", as_strings = TRUE)
print.data.frame(head(cnlp_get_tif(anno)))

  doc_id sid tid  word lemma  upos   pos cid pid case definite    gender number
1   doc1   1   1   Wir   wir  PRON  PPER   0   1  Nom     <NA>      <NA>   Plur
2   doc1   1   2 gehen gehen  VERB VVFIN   4   1 <NA>     <NA>      <NA>   Plur
3   doc1   1   3    zu    zu   ADP  APPR  10   1 <NA>     <NA>      <NA>   <NA>
4   doc1   1   4   dem   der   DET   ART  13   1  Dat      Def Masc,Neut   Sing
5   doc1   1   5  kino  kino PROPN    NN  17   1  Dat     <NA> Masc,Neut   Sing
  person pron_type verb_form source relation word_source lemma_source spaces
1      1       Prs      <NA>      2    nsubj       gehen        gehen      1
2      1      <NA>       Fin      0     root        ROOT         ROOT      1
3   <NA>      <NA>      <NA>      5     case        kino         kino      1
4   <NA>       Art      <NA>      5      det        kino         kino      1
5   <NA>      <NA>      <NA>      2      obl       gehen        gehen      0

If you have any other suggestions I would greatly appreciate them!

Thank you for the changes. This looks good to me. Thank you for all the effort!

@statsmaths FYI. udpipe_download_model now has gained an extra argument called udpipe_model_repo indicating from which github repository to download the model from. You can either indicate in that argument

"jwijffels/udpipe.models.ud.2.0": models provided by the UDPipe community
"bnosac/udpipe.models.ud": models trained using the udpipe R package directly

The default is 'jwijffels/udpipe.models.ud.2.0'

statsmaths / cleanNLP

Udpipe as backend #27