POS tagging of a corpus object

unDocUMeantIt / koRpus

An R Package for Text Analysis

GNU General Public License v3.0

45 stars 6 forks source link

POS tagging of a corpus object #8

Closed stefan-mueller closed 7 years ago

stefan-mueller commented 7 years ago

First of all, thanks for developing this package!

I am currently working on POS tagging of text corpora in several languages. For German and English I use a combination of spacyr and quanteda..

For additional languages, I would like to use a koRpus and the TreeTagger. Is there a way to perform POS tagging directly on the text field in a corpus object? Or do you have a script that extracts the text field of a corpus for each document and applies POS tagging in a loop?

Thanks a lot for your help. Stefan

unDocUMeantIt commented 7 years ago

hi stefan,

funny, i'll meet ken later this week and guess we'll be talking a lot about compatibility between packages ;-)

that said, there's currently no clearly defined path to combining our packages. but since you're about to be using the treetag() function, you can use the option format="obj" to directly tag text in a character vector. as long as you can get that out of a given object, you can tag it. if you need the result in a data.frame format, call taggedText() on the result (the develop branch also supports subsetting and replacements using [ and [[).

the tm.plugin.koRpus package extends koRpus' possibilities to analyze a full corpus instead of single texts.

stefan-mueller commented 7 years ago

Great, thanks a lot for your reply. I heard about the get-together, and hope you come up with ideas of how to combine the strengths of each package.

I will follow your guidelines above and try to come up with a MWE that tags text either from a tm or quanteda corpus. We might add this to your vignette afterwards – I'm probably not the only one who faces this problem. What I need is a unique identifier for each tagged document (in my case: sentence) in the resulting data frame because I need to count the occurrences of certain POS tags for each document. I got this working with spacyr already, and I will try to come up with a solution using koRpus, and get back to you afterwards.

unDocUMeantIt commented 7 years ago

try summary() on a tagged object.

unDocUMeantIt commented 7 years ago

looks like this is resolved for now.