stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.59k stars 2.7k forks source link

Parser for Multilingual support - Danish #683

Closed Sudhir1000 closed 6 years ago

Sudhir1000 commented 6 years ago

Hi Team,,

Thanks for the support!! I am trying to use danish for my NLP project but unable to find the parser for the same.3 I could also see the multilingual support for french, german etc but not Danish. Is there a way to achieve danish parsing or do I need to add any library for it?

Please do the needful it's really appreciatable Thanks Sudhir (CBS-Denmark)

J38 commented 6 years ago

Do you want a constituency parser or a dependency parser ? We don't have anything for constituency parsing. If you wanted a Danish dependency parser there is data for that I believe, and I could potentially train one for you.

Sudhir1000 commented 6 years ago

Thanks for the reply. I need a Danish dependency parser and it would be helpful if you could provide a trained one. Cheers.

J38 commented 6 years ago

So I won't really have time to train a Danish model for you, but I can show you the command to train one yourself and point you to the location of Danish resources:

Here is a command I used to train a French model (NOTE: I'm not sure the proper memory to use so I just put an absurdly large number)

java -Xmx70g edu.stanford.nlp.parser.nndep.DependencyParser -trainFile fr-ud-train.conllu -devFile fr-ud-dev.conllu -model french-UD-parser.txt.gz -embedFile wiki.fr.vec -embeddingSize 300 -tlp edu.stanford.nlp.trees.international.french.FrenchTreebankLanguagePack -cPOS

Some comments:

  1. the -cPOS indicates you'd like the model to use part of speech tags. You will also have to train a Danish part of speech tagger if you want to use that. For your first attempt, you might want to not use this option.

  2. You can find the Danish equivalent of the word embeddings I used here: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

  3. You should probably just use the English treebank, or otherwise study how the FrenchTreebankLanguagePack is structured and try to replicate that for Danish. It important that you tokenize your text the same way it is tokenized in the dependency parsing data. I am not sure what issues arise when handling Danish.

  4. The Danish dependency parse training data is located here: http://universaldependencies.org/

Also, at some point this year, we'd like to release a next generation dependency parser in Python, and I think it would handle A LOT of languages, presumably including Danish.