Closed mjockers closed 4 years ago
Hi Matt — That's a great question (and one I should probably explain on the README). The dependencies still exist, but are attached directly to the tokens table. There's a one-to-one correspondence between the two, and a fairly canonical way of mapping one to the other. So now, I just keep them together in the tokens table from the start. My first step in almost any analysis is to join the two together, anyway.
If you want to be able to get the source of the dependency relation, you only need to do a self-join on the tokens table. If you annotate something like in the README:
library(dplyr)
library(cleanNLP)
cnlp_init_udpipe()
annotation <- cnlp_annotate(input = c(
"Here is the first text. It is short.",
"Here's the second. It is short too!",
"The third text is the shortest."
))
Then you can get the source dependences by:
with_source <- annotation$token %>%
left_join(select(annotation$token, doc_id, sid, tid, token, lemma),
by = c("doc_id"="doc_id", "sid"="sid", "tid_source"="tid"),
suffix = c("", "_source"))
select(with_source, doc_id, sid, tid, token, token_source, relation)
# A tibble: 27 x 6
doc_id sid tid token token_source relation
<int> <int> <chr> <chr> <chr> <chr>
1 1 1 1 Here NA root
2 1 1 2 is Here cop
3 1 1 3 the text det
4 1 1 4 first text amod
5 1 1 5 text Here nsubj
6 1 1 6 . Here punct
7 1 2 1 It short nsubj
8 1 2 2 is short cop
9 1 2 3 short NA root
10 1 2 4 . short punct
# … with 17 more rows
I am working on a helper function that automates the above task, while also allowing you to move "up the dependency tree". Still trying to figure out the right semantics for it, but hopefully will be available in the next release.
Glad to hear that you find the package useful. Please let me know if you have any other questions or suggestions!
Thank you! Very slick.
Hi Taylor, Love the clean[NLP]ness of the new version! But I am missing some things (or don't see how to get them). In the previous version, using cnlp_init_spacy(), the annotation object included $dependencies. Now just: "token", "entity" , "document." Are dependencies no longer calculated as part of cnlp_annotate()? Am I missing an argument that needs to be set? Thanks, Matt