huggingface / neuralcoref

✨Fast Coreference Resolution in spaCy with Neural Networks
https://huggingface.co/coref/
MIT License
2.85k stars 477 forks source link

question: German language support? #232

Open Rainer-Kempkes opened 4 years ago

Rainer-Kempkes commented 4 years ago

hi, will there be - and if, when? - support for German language? Best, Rainer

svlandeg commented 4 years ago

In principle the training algorithm is language-agnostic, and you can train a model on any language you have annotated data for, cf. https://github.com/huggingface/neuralcoref/blob/master/neuralcoref/train/training.md#train-on-a-new-language

With respect to having a pretrained German model available, it'll depend on whether we can identify a public dataset we can use for this purpose. I haven't looked into that yet.

chieter commented 4 years ago

There is this corpus: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2614 While initially created to aid machine translation and translation of coreferences between english and german it could be used to train neuralcoref. I actually did already extract only the german part and reformatted it to the conll-format used by neuralcoref. The training does give me some trouble though.

There are several problems. For one, I lack the hardware to train a big ml model like neuralcoref. (I tried training it on a google colab instance, but I keep running into the 12h timeout). And I'm not sure about my choice of embeddings. My german word2vec embedding files are about 2GB (300 dim). I'm not sure if this is too big, as it is about 50 times the size of the embeddings in the neuralcoref repository. Furthermore I'm not really sure what to do with the static / tuned part of the embeddings. I reckon that the tuned embeddings will be a byproduct of the trained coref model, but I'm really not sure.

Then there is the problem of the corpus size. This corpus contains 2425 annotated coreference chains for german. I'm not sure this is enough.

Any help resolving these issues, so I can train a model would be greatly appreciated :)

EricLe-dev commented 4 years ago

There is this corpus: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2614 While initially created to aid machine translation and translation of coreferences between english and german it could be used to train neuralcoref. I actually did already extract only the german part and reformatted it to the conll-format used by neuralcoref. The training does give me some trouble though.

There are several problems. For one, I lack the hardware to train a big ml model like neuralcoref. (I tried training it on a google colab instance, but I keep running into the 12h timeout). And I'm not sure about my choice of embeddings. My german word2vec embedding files are about 2GB (300 dim). I'm not sure if this is too big, as it is about 50 times the size of the embeddings in the neuralcoref repository. Furthermore I'm not really sure what to do with the static / tuned part of the embeddings. I reckon that the tuned embeddings will be a byproduct of the trained coref model, but I'm really not sure.

Then there is the problem of the corpus size. This corpus contains 2425 annotated coreference chains for german. I'm not sure this is enough.

Any help resolving these issues, so I can train a model would be greatly appreciated :)

How did you construct your tuned_word_embeddings for the training? Can you please share with me?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ChrisDelClea commented 3 years ago

any updates on this?

mt0rm0 commented 1 year ago

I would be also interested