unDocUMeantIt / koRpus

An R Package for Text Analysis
GNU General Public License v3.0
45 stars 6 forks source link

How can I extract proper nouns? #23

Closed thomasrenkert closed 4 years ago

thomasrenkert commented 4 years ago

I know how to extract proper nouns from a corpus in quanteda with spacyr. But for another corpus I need to use treetagger. I was able to lemmatize the corpus with koRpus and treetagger, but I don't know how to further analyze word forms and parts of speech. For instance, I would like to get a list of all proper nouns within the corpus. How can I do that in koRpus?

unDocUMeantIt commented 4 years ago

hi thomas, try ?query, e.g. assuming your tagged object is called your_text:

# filter by word class
query(your_text, var="wclass", query="name")
# or by POS tag
query(your_text, var="tag", query="NP")
thomasrenkert commented 4 years ago

Hi, thanks for your quick reply!

I get the error

Invalid var for class kRp.tagged: tag

unDocUMeantIt commented 4 years ago

which version of koRpus are you using? there were bugs in query() fixed in 0.12-1.

thomasrenkert commented 4 years ago

I've tested it with the latest CRAN version and also with the development version from github. The error persists.

unDocUMeantIt commented 4 years ago

that's odd, i can't reproduce the issue. could you please

  1. give some environmental data on your setup (e.g., operating system, versions of R & koRpus)

  2. post the relevant code blocks you are running (i guess it is not related to the particular text you are tagging)

thomasrenkert commented 4 years ago

It works now, but only with the development versions from github and only when installing sylly separately.

library(devtools)
install_github("unDocUMeantIt/sylly", ref="develop")
install_github("unDocUMeantIt/koRpus", ref="develop")
library(koRpus)
install.koRpus.lang(lang=c("en", "de"))
library(koRpus.lang.de)
tagged_corpus <- treetag(
  "corpus.txt",
  treetagger="/opt/treetagger/cmd/tree-tagger-german",
  lang="de"
  )
names_corpus <- query(tagged_corpus, var="wclass", query="name")
unDocUMeantIt commented 4 years ago

yes, the development version is the forthcoming 0.13 release which has drastic changes under the hood compared to 0.12, which in turn already was a huge step from 0.11-5 (CRAN). the object classes are totally redesigned and the package depends on minor changes done to sylly, that's why you must use its develop branch as well. usage didn't change so much, it's just the internals.

0.12 was like an interim release, that's why i didn't push it to CRAN but wait for 0.13 to be ready instead. if you encounter any issues, let me know. i think it is rather stable and safe to use already.

unDocUMeantIt commented 4 years ago

btw, i'd recommend to try the presets, e.g.

set.kRp.env(
    TT.cmd="manual",
    TT.options=list(
        path="/opt/treetagger",
        preset="de"
    ),
    lang="de"
)
tagged_corpus <- treetag("corpus.txt")