unDocUMeantIt / koRpus

An R Package for Text Analysis
GNU General Public License v3.0
45 stars 6 forks source link

Missing tags for Danish #19

Closed epetrovski closed 4 years ago

epetrovski commented 5 years ago

Hi,

I'm trying to apply POS-tagging on a sample of Danish text with TreeTagger and the "official" Danish parameter file. However, it seems that tags produced are not recognized by koRpus.

I'm wondering if this is simply due to the fact that there is no Danish language pack?

library("koRpus")
library("koRpus.lang.en")

treetag("~/bin/treetagger/test.txt",
        treetagger = "manual",
        lang = "en",
        TT.options = list(path = "~/treetagger",
                          tokenizer = "tree-tagger-danish",
                          tagger = "tree-tagger",
                          params = "danish.par",
                          abbrev = "danish-abbreviations")
        )
#> Warning: Invalid tag(s) found: PM:s-un:--:----, VF:----:sa:----, PI:s-uc:--:----, AD:----:--:p---, AC:siu§:--:p---, NC:siuc:--:----, T-:----:--:----, NC:siun:--:----
#>   This is probably due to a missing tag in kRp.POS.tags() and
#>   needs to be fixed. It would be nice if you could forward the
#>   above warning dump as a bug report to the package maintaner!
#>   doc_id token             tag lemma lttr   wclass desc stop stem idx sntc
#> 1   <NA> Dette PM:s-un:--:---- denne    5  unknown   NA   NA   NA   1    1
#> 2   <NA>    er VF:----:sa:----  være    2  unknown   NA   NA   NA   2    1
#> 3   <NA>    en PI:s-uc:--:----    en    2  unknown   NA   NA   NA   3    1
#> 4   <NA> meget AD:----:--:p--- megen    5  unknown   NA   NA   NA   4    1
#> 5   <NA>  kort AC:siu§:--:p---  kort    4  unknown   NA   NA   NA   5    1
#> 6   <NA> tekst NC:siuc:--:---- tekst    5  unknown   NA   NA   NA   6    1
#> 7   <NA>    på T-:----:--:----    på    2  unknown   NA   NA   NA   7    1
#> 8   <NA> dansk NC:siun:--:---- dansk    5  unknown   NA   NA   NA   8    1
#> 9   <NA>     .            SENT     .    1 fullstop   NA   NA   NA   9    1
unDocUMeantIt commented 5 years ago

I'm wondering if this is simply due to the fact that there is no Danish language pack?

yes, the language packages add the tagset definitions to koRpus. it doesn't have to be a package though, but a call of the function set.lang.support() with all tags explained. in fact, the language packages simply include these calls and nothing else.

set.lang.support() has three "modes" -- adding the language specific parts of the TreeTagger script, adding all possible POS tags, and adding the hyphenation patterns. i can easily do the first and last, but adding the POS tags requires someone who actually understands the language. if you want i can explain what you'd need to do, and we could have a language package for danish in no time ;)

unDocUMeantIt commented 5 years ago

as a shortcut, have a look at the dutch language support file. the critical part is the portion between lines 114 and 173. the first block (lines 56 to 95) i can take from TreeTagger's batch files.

epetrovski commented 5 years ago

Okay, I'll take a swing at this. Is it okay if I just clone the Dutch language support repo and try and fill in what I can?

I'm basing the tags off of the documentation for the Danish parameter file which is excellent: https://korpus.dsl.dk/clarin/corpus-doc/pos-design.pdf

unDocUMeantIt commented 5 years ago

sure, that should work. i can copy&paste what i need afterwards and set up a new package repo.

if you can, you should then test the new package on a larger amount of text to check if all tags are defined. it happened before that the parameter files return tags that are not part of the official documenataion.

thanks for doing this!

unDocUMeantIt commented 4 years ago

are you still working on this? if not i'll close this issue.

epetrovski commented 4 years ago

Sorry. Never found the time and moved on to other project.