Open Rainer-Kempkes opened 4 years ago
In principle the training algorithm is language-agnostic, and you can train a model on any language you have annotated data for, cf. https://github.com/huggingface/neuralcoref/blob/master/neuralcoref/train/training.md#train-on-a-new-language
With respect to having a pretrained German model available, it'll depend on whether we can identify a public dataset we can use for this purpose. I haven't looked into that yet.
There is this corpus: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2614 While initially created to aid machine translation and translation of coreferences between english and german it could be used to train neuralcoref. I actually did already extract only the german part and reformatted it to the conll-format used by neuralcoref. The training does give me some trouble though.
There are several problems. For one, I lack the hardware to train a big ml model like neuralcoref. (I tried training it on a google colab instance, but I keep running into the 12h timeout). And I'm not sure about my choice of embeddings. My german word2vec embedding files are about 2GB (300 dim). I'm not sure if this is too big, as it is about 50 times the size of the embeddings in the neuralcoref repository. Furthermore I'm not really sure what to do with the static / tuned part of the embeddings. I reckon that the tuned embeddings will be a byproduct of the trained coref model, but I'm really not sure.
Then there is the problem of the corpus size. This corpus contains 2425 annotated coreference chains for german. I'm not sure this is enough.
Any help resolving these issues, so I can train a model would be greatly appreciated :)
There is this corpus: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2614 While initially created to aid machine translation and translation of coreferences between english and german it could be used to train neuralcoref. I actually did already extract only the german part and reformatted it to the conll-format used by neuralcoref. The training does give me some trouble though.
There are several problems. For one, I lack the hardware to train a big ml model like neuralcoref. (I tried training it on a google colab instance, but I keep running into the 12h timeout). And I'm not sure about my choice of embeddings. My german word2vec embedding files are about 2GB (300 dim). I'm not sure if this is too big, as it is about 50 times the size of the embeddings in the neuralcoref repository. Furthermore I'm not really sure what to do with the static / tuned part of the embeddings. I reckon that the tuned embeddings will be a byproduct of the trained coref model, but I'm really not sure.
Then there is the problem of the corpus size. This corpus contains 2425 annotated coreference chains for german. I'm not sure this is enough.
Any help resolving these issues, so I can train a model would be greatly appreciated :)
How did you construct your tuned_word_embeddings for the training? Can you please share with me?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
any updates on this?
I would be also interested
hi, will there be - and if, when? - support for German language? Best, Rainer