Closed epetrovski closed 4 years ago
I'm wondering if this is simply due to the fact that there is no Danish language pack?
yes, the language packages add the tagset definitions to koRpus. it doesn't have to be a package though, but a call of the function set.lang.support()
with all tags explained. in fact, the language packages simply include these calls and nothing else.
set.lang.support()
has three "modes" -- adding the language specific parts of the TreeTagger script, adding all possible POS tags, and adding the hyphenation patterns. i can easily do the first and last, but adding the POS tags requires someone who actually understands the language. if you want i can explain what you'd need to do, and we could have a language package for danish in no time ;)
as a shortcut, have a look at the dutch language support file. the critical part is the portion between lines 114 and 173. the first block (lines 56 to 95) i can take from TreeTagger's batch files.
Okay, I'll take a swing at this. Is it okay if I just clone the Dutch language support repo and try and fill in what I can?
I'm basing the tags off of the documentation for the Danish parameter file which is excellent: https://korpus.dsl.dk/clarin/corpus-doc/pos-design.pdf
sure, that should work. i can copy&paste what i need afterwards and set up a new package repo.
if you can, you should then test the new package on a larger amount of text to check if all tags are defined. it happened before that the parameter files return tags that are not part of the official documenataion.
thanks for doing this!
are you still working on this? if not i'll close this issue.
Sorry. Never found the time and moved on to other project.
Hi,
I'm trying to apply POS-tagging on a sample of Danish text with TreeTagger and the "official" Danish parameter file. However, it seems that tags produced are not recognized by koRpus.
I'm wondering if this is simply due to the fact that there is no Danish language pack?