statsmaths / cleanNLP

R package providing annotators and a normalized data model for natural language processing
GNU Lesser General Public License v2.1
211 stars 36 forks source link

Udpipe as backend #27

Closed jwijffels closed 6 years ago

jwijffels commented 6 years ago

Would it make sense to also add the udpipe r package as a backend. That package also has no external dependencies and provides tokenisation, lemmatisation, pos tagging, feature tagging and dependency parsing. Package available at https://cran.r-project.org/web/packages/udpipe/index.html

statsmaths commented 6 years ago

Yes! That would be a great idea actually. I was not aware of the udpipe package, but had been hoping someone would implement something that does not depend on Python or Java. More updates soon.

jwijffels commented 6 years ago

Great! If you need any help or input from me on this, let me know.

statsmaths commented 6 years ago

I just pushed cleanNLP version 2.0.0 to GitHub. It is a major re-write of the annotation tasks so it still needs some testing, but it now supports the udpipe backed. Load from GH:

devtools::install_github("statsmaths/cleanNLP")

And then a minimal working example is given with this input data:

text <- c("It is better to be looked over than overlooked.",
         "Real stupidity beats artificial intelligence every time.",
         "The secret of getting ahead is getting started.")
input <- data.frame(doc_id = c("West", "Pratchett", "Twain"),
                        text = text,
                        stringsAsFactors = FALSE)

Parsed with:

library(cleanNLP)
cnlp_init_udpipe()
output <- cnlp_get_tif(cnlp_annotate_tif(input))
print.data.frame(head(output))
##   doc_id sid tid   word  lemma upos pos cid pid case definite degree
## 1   West   1   1     It     it PRON PRP   0   1  Nom     <NA>   <NA>
## 2   West   1   2     is     be  AUX VBZ   3   1 <NA>     <NA>   <NA>
## 3   West   1   3 better better  ADJ JJR   6   1 <NA>     <NA>    Cmp
## 4   West   1   4     to     to PART  TO  13   1 <NA>     <NA>   <NA>
## 5   West   1   5     be     be  AUX  VB  16   1 <NA>     <NA>   <NA>
## 6   West   1   6 looked   look VERB VBN  19   1 <NA>     <NA>   <NA>
##   gender mood number person pron_type tense verb_form voice source
## 1   Neut <NA>   Sing      3       Prs  <NA>      <NA>  <NA>      3
## 2   <NA>  Ind   Sing      3      <NA>  Pres       Fin  <NA>      3
## 3   <NA> <NA>   <NA>   <NA>      <NA>  <NA>      <NA>  <NA>      0
## 4   <NA> <NA>   <NA>   <NA>      <NA>  <NA>      <NA>  <NA>      6
## 5   <NA> <NA>   <NA>   <NA>      <NA>  <NA>       Inf  <NA>      6
## 6   <NA> <NA>   <NA>   <NA>      <NA>  Past      Part  Pass      3
##   relation word_source lemma_source spaces
## 1     expl      better       better      1
## 2      cop      better       better      1
## 3     root        ROOT         ROOT      1
## 4     mark      looked         look      1
## 5 aux:pass      looked         look      1
## 6    csubj      better       better      1

Please let me know if you have any thoughts or suggestions on how it has been incorporated. Thanks again for letting me know about the package!

jwijffels commented 6 years ago

Waw. Great effort!

I've had a look to the code. Here some remarks:

library(udpipe)
m <- udpipe_download_model("german")
m <- udpipe_load_model(m$file_model)
x <- udpipe_annotate(m, x = "Wir gehen zum kino")
as.data.frame(x)
  doc_id paragraph_id sentence_id           sentence token_id token lemma  upos  xpos                                                           feats head_token_id
1   doc1            1           1 Wir gehen zum kino        1   Wir   wir  PRON  PPER                      Case=Nom|Number=Plur|Person=1|PronType=Prs             2
2   doc1            1           1 Wir gehen zum kino        2 gehen gehen  VERB VVFIN                               Number=Plur|Person=1|VerbForm=Fin             0
3   doc1            1           1 Wir gehen zum kino      3-4   zum  <NA>  <NA>  <NA>                                                            <NA>          <NA>
4   doc1            1           1 Wir gehen zum kino        3    zu    zu   ADP  APPR                                                            <NA>             5
5   doc1            1           1 Wir gehen zum kino        4   dem   der   DET   ART Case=Dat|Definite=Def|Gender=Masc,Neut|Number=Sing|PronType=Art             5
6   doc1            1           1 Wir gehen zum kino        5  kino  kino PROPN    NN                           Case=Dat|Gender=Masc,Neut|Number=Sing             2

library(cleanNLP)
cnlp_init_udpipe(model_name = "german")
anno <- cnlp_annotate("Wir gehen zum kino", as_strings = TRUE)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  scan() expected 'an integer', got '3-4'

Example on locally trained model.

cnlp_init_udpipe(model_path = "C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dev/nl-lassysmall-token-tag.udpipe")
anno <- cnlp_annotate("Ik ga op reis naar de Caraiben.", as_strings = TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  no lines available in input

Browse[2]> 
debug: anno <- udpipe::udpipe_annotate(volatiles$udpipe$model_obj, input_txt)
Browse[2]> 
debug: out <- from_udpipe_CoNLL(anno$conllu)
Browse[2]> anno
$x
[1] "Ik ga op reis naar de Caraiben."

$conllu
[1] ""

$errors
[1] "No parser defined for the UDPipe model!"

attr(,"class")
[1] "udpipe_connlu"

## This works as follows:
m <- udpipe_load_model("C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dev/nl-lassysmall-token-tag.udpipe")
x <- udpipe_annotate(m, x = "Ik ga op reis naar de Caraiben.", parser = "none")
as.data.frame(x)
  doc_id paragraph_id sentence_id                        sentence token_id    token    lemma  upos                         xpos                          feats head_token_id dep_rel deps
1   doc1            1           1 Ik ga op reis naar de Caraiben.        1       Ik       ik  PRON VNW|pers|pron|nomin|vol|1|ev Case=Nom|Person=1|PronType=Prs          <NA>    <NA> <NA>
2   doc1            1           1 Ik ga op reis naar de Caraiben.        2       ga       ga  NOUN             N|soort|mv|basis                    Number=Plur          <NA>    <NA> <NA>
3   doc1            1           1 Ik ga op reis naar de Caraiben.        3       op       op   ADP                      VZ|init                           <NA>          <NA>    <NA> <NA>
4   doc1            1           1 Ik ga op reis naar de Caraiben.        4     reis     reis  NOUN   N|soort|ev|basis|zijd|stan         Gender=Com|Number=Sing          <NA>    <NA> <NA>
5   doc1            1           1 Ik ga op reis naar de Caraiben.        5     naar     naar   ADP                      VZ|init                           <NA>          <NA>    <NA> <NA>
6   doc1            1           1 Ik ga op reis naar de Caraiben.        6       de       de   DET            LID|bep|stan|rest                   Definite=Def          <NA>    <NA> <NA>
7   doc1            1           1 Ik ga op reis naar de Caraiben.        7 Caraiben Caraiben  NOUN   N|soort|ev|basis|zijd|stan         Gender=Com|Number=Sing          <NA>    <NA> <NA>
8   doc1            1           1 Ik ga op reis naar de Caraiben.        8        .        . PUNCT                          LET                           <NA>          <NA>    <NA> <NA>

Maybe for these models it makes sense that the model_name is being kept instead of calling it custom

cnlp_init_udpipe(model_path = "C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dev/nl-lassysmall-token-tag-parse.udpipe", model_name = "nl")
anno <- cnlp_annotate("Ik ga op reis naar de Caraiben.", as_strings = TRUE)
cnlp_get_document(anno)
 doc_id                time version language  uri
  doc1 2018-01-03 10:30:29     0.3   custom <NA>

Glad you like the package! It saves me al least a whole of installation problems when giving courses on text mining.

statsmaths commented 6 years ago

Thanks for quick and thorough feedback. Yes, I agree that this will be a great benefit to teaching and casual users. I just uploaded version 2.0.1 to GitHub with the following changes:

Here is an example of the features and parsers turned off:

library(cleanNLP)
cnlp_init_udpipe(feature_flag = FALSE, parser = "none")
output <- cnlp_get_tif(cnlp_annotate("It is better to be looked over than overlooked."))
print.data.frame(head(output))
  doc_id sid tid   word  lemma upos pos cid pid spaces
1   doc1   1   1     It     it PRON PRP   0   1      1
2   doc1   1   2     is     be  AUX VBZ   3   1      1
3   doc1   1   3 better better  ADJ JJR   6   1      1
4   doc1   1   4     to     to PART  TO  13   1      1
5   doc1   1   5     be     be  AUX  VB  16   1      1
6   doc1   1   6 looked   look VERB VBN  19   1      1

And one parsing the German text you referenced:

library(cleanNLP)
cnlp_init_udpipe(model_name = "german")
anno <- cnlp_annotate("Wir gehen zum kino", as_strings = TRUE)
print.data.frame(head(cnlp_get_tif(anno)))
  doc_id sid tid  word lemma  upos   pos cid pid case definite    gender number
1   doc1   1   1   Wir   wir  PRON  PPER   0   1  Nom     <NA>      <NA>   Plur
2   doc1   1   2 gehen gehen  VERB VVFIN   4   1 <NA>     <NA>      <NA>   Plur
3   doc1   1   3    zu    zu   ADP  APPR  10   1 <NA>     <NA>      <NA>   <NA>
4   doc1   1   4   dem   der   DET   ART  13   1  Dat      Def Masc,Neut   Sing
5   doc1   1   5  kino  kino PROPN    NN  17   1  Dat     <NA> Masc,Neut   Sing
  person pron_type verb_form source relation word_source lemma_source spaces
1      1       Prs      <NA>      2    nsubj       gehen        gehen      1
2      1      <NA>       Fin      0     root        ROOT         ROOT      1
3   <NA>      <NA>      <NA>      5     case        kino         kino      1
4   <NA>       Art      <NA>      5      det        kino         kino      1
5   <NA>      <NA>      <NA>      2      obl       gehen        gehen      0

If you have any other suggestions I would greatly appreciate them!

jwijffels commented 6 years ago

Thank you for the changes. This looks good to me. Thank you for all the effort!

jwijffels commented 6 years ago

@statsmaths FYI. udpipe_download_model now has gained an extra argument called udpipe_model_repo indicating from which github repository to download the model from. You can either indicate in that argument

The default is 'jwijffels/udpipe.models.ud.2.0'