Closed Tpt closed 6 years ago
Hi Thomas, yes, this is indeed a gap, which it would be nice to address. Thanks for your contribution that fills it in the meantime. At the end of the day, this reflects that CoreNLP is mainly getting dual purpose value from our research projects, and there is only very limited labor available to directly improve CoreNLP. However, fortunately, with the upcoming CoNLL 2017 shared task on UD parsing, there are likely to be good opportunities for us to build UD POS taggers for various languages....
@Tpt have you gotten reasonable French results with your workaround? I have built a new part-of-speech tagger in TensorFlow, it might be interesting to train a UD French part-of-speech model. I think it'd be cool to fix this French issue for Stanford CoreNLP 3.8.0.
@J38 Yes, French results are fairly good (but not amazing). See this live version: http://corenlp.askplatyp.us/1.7
I have built a new part-of-speech tagger in TensorFlow, it might be interesting to train a UD French part-of-speech model.
Great!
Ok I have made a first attempt at training a French UD POS tagging model.
It is available at this path: edu/stanford/nlp/models/pos-tagger/french/french-ud.tagger
in the French models jar. (you will need to download the latest French models jar from the main GitHub page). This was trained on some French UD treebank data we have. It had a 94.4% accuracy on a corresponding French UD treebank test set.
If you have a chance @Tpt would you please try it out ? Over the next few weeks we will probably work on making better models for French and Spanish to include in the Stanford CoreNLP 3.8.0 release! This model was trained with a CRF and works with the Java code.
I am going to try to review the landscape and figure out what the canonical French POS tagging and NER datasets/results are currently. If you happen to know of any papers/systems to look at that would be very helpful. Soon I want to try to train my Bi-LSTM model for French and Spanish and release that with close to state of the art results.
@J38 Thank you very much for having worked on an other model! How could I run it? I tried to replace the default model by this new one in the CoreNLP server config [1] but it seems that the server still serves the previous tagger. Have I done something wrong?
About datasets on French, as my knowledge there is:
[1] https://github.com/askplatypus/CoreNLP/commit/f48c05e990eab5a2e4ed1abb00d804af4e814879
Hello! I just read this thread and I have questions:
Regards,
Vincent
Hi @Tpt I'm sorry it's taken so long to get back to you on this! If you start your server with this option added: -serverProperties server_properties.prop
and then make a file called server_properties.prop
with these properties:
annotators = tokenize, ssplit, pos, depparse
tokenize.language = fr
pos.model = edu/stanford/nlp/models/pos-tagger/french/french-ud.tagger
parse.model = edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz
# dependency parser
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_French.gz
depparse.language = french
It should start the server with those specific properties and load the new UD French POS tagger.
@up4
2 & 3. I believe currently things have been trained on UD 1.3. Eventually we will retrain things on UD 2.0. There is a new dependency parser that we will be trying to integrate into CoreNLP, and that will probably use the UD 2.0 dependencies.
@J38 1) Ok thanks ! I ended up getting them to work and experiment with both on French Canadian news sources… 2-3) Is there something I can read or someone I can talk to in order to learn how to retrain the French CoreNLP on UD 2.0 myself ? 4) A quick googling of "french coref data set" got me the link for ANCOR_Centre but I think it's spoken French only. I would like to create my own UD 2.0 entry from written (and eventually spoken) French Canadian news sources, how do I make sure it's properly "coref-annotated" so it's useable to train a French NER model ?
Thanks for your time.
ps: I found this article from 2010 about the lack of satisfying French coref training corpus.
This command should allow for training a new model:
java -Xmx4g edu.stanford.nlp.tagger.maxent.MaxentTagger -props french-ud.tagger.props
You can find the props file used here:
https://github.com/stanfordnlp/CoreNLP/blob/master/scripts/pos-tagger/french-ud.tagger.props
You will of course need to change the path to wherever you download and store the french-ud data. Note also that the data is in .conll format, so it should be one token per line, each line of the form word\tpos_tag
... sentences separated by a blank line.
Hi,
I have a question regarding your French UD model. Here you mentioned you trained it from “some french-ud treebanks” you had. Also, in the script says ‘french-ud-train.conll’. May I know which one of following French UD treebanks are you using to deliver your french-ud model? http://universaldependencies.org/fr/index.html
As an example, it would be nice to have some stats about these models like spacy (I’ve never used it myself, it just looks informative): https://spacy.io/models/fr
The French POS tagger provided by CoreNLP outputs French Treebank POS tags and the French dependency parser have been trained with UniversalDependencies POS tags. So, it is not possible to use CoreNLP POS tagger to run the CoreNLP dependency parsing.
I have written a hack to CoreNLP in order to make the French POS tagger output UD POS tags: https://github.com/askplatypus/CoreNLP/commit/e6215bdc5d4903bc3e2d2fb533da7e3938fa825f
See also: http://stackoverflow.com/questions/36634101/dependency-parsing-for-french-with-corenlp and https://mailman.stanford.edu/pipermail/java-nlp-user/2016-April/007560.html