statsmaths / cleanNLP

R package providing annotators and a normalized data model for natural language processing
GNU Lesser General Public License v2.1
209 stars 36 forks source link

Dependencies? #66

Closed mjockers closed 4 years ago

mjockers commented 4 years ago

Hi Taylor, Love the clean[NLP]ness of the new version! But I am missing some things (or don't see how to get them). In the previous version, using cnlp_init_spacy(), the annotation object included $dependencies. Now just: "token", "entity" , "document." Are dependencies no longer calculated as part of cnlp_annotate()? Am I missing an argument that needs to be set? Thanks, Matt

statsmaths commented 4 years ago

Hi Matt — That's a great question (and one I should probably explain on the README). The dependencies still exist, but are attached directly to the tokens table. There's a one-to-one correspondence between the two, and a fairly canonical way of mapping one to the other. So now, I just keep them together in the tokens table from the start. My first step in almost any analysis is to join the two together, anyway.

If you want to be able to get the source of the dependency relation, you only need to do a self-join on the tokens table. If you annotate something like in the README:

library(dplyr)
library(cleanNLP)
cnlp_init_udpipe()

annotation <- cnlp_annotate(input = c(
        "Here is the first text. It is short.",
        "Here's the second. It is short too!",
        "The third text is the shortest."
))

Then you can get the source dependences by:

with_source <- annotation$token %>%
    left_join(select(annotation$token, doc_id, sid, tid, token, lemma),
                   by = c("doc_id"="doc_id", "sid"="sid", "tid_source"="tid"),
                   suffix = c("", "_source"))
select(with_source, doc_id, sid, tid, token, token_source, relation) 
# A tibble: 27 x 6
   doc_id   sid tid   token token_source relation
    <int> <int> <chr> <chr> <chr>        <chr>
 1      1     1 1     Here  NA           root
 2      1     1 2     is    Here         cop
 3      1     1 3     the   text         det
 4      1     1 4     first text         amod
 5      1     1 5     text  Here         nsubj
 6      1     1 6     .     Here         punct
 7      1     2 1     It    short        nsubj
 8      1     2 2     is    short        cop
 9      1     2 3     short NA           root
10      1     2 4     .     short        punct
# … with 17 more rows

I am working on a helper function that automates the above task, while also allowing you to move "up the dependency tree". Still trying to figure out the right semantics for it, but hopefully will be available in the next release.

Glad to hear that you find the package useful. Please let me know if you have any other questions or suggestions!

mjockers commented 4 years ago

Thank you! Very slick.