stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.59k stars 2.7k forks source link

French POS tagger outputs French Treebank POS tags and dependency parsing expects Universal Dependencies tags #312

Closed Tpt closed 6 years ago

Tpt commented 7 years ago

The French POS tagger provided by CoreNLP outputs French Treebank POS tags and the French dependency parser have been trained with UniversalDependencies POS tags. So, it is not possible to use CoreNLP POS tagger to run the CoreNLP dependency parsing.

I have written a hack to CoreNLP in order to make the French POS tagger output UD POS tags: https://github.com/askplatypus/CoreNLP/commit/e6215bdc5d4903bc3e2d2fb533da7e3938fa825f

See also: http://stackoverflow.com/questions/36634101/dependency-parsing-for-french-with-corenlp and https://mailman.stanford.edu/pipermail/java-nlp-user/2016-April/007560.html

manning commented 7 years ago

Hi Thomas, yes, this is indeed a gap, which it would be nice to address. Thanks for your contribution that fills it in the meantime. At the end of the day, this reflects that CoreNLP is mainly getting dual purpose value from our research projects, and there is only very limited labor available to directly improve CoreNLP. However, fortunately, with the upcoming CoNLL 2017 shared task on UD parsing, there are likely to be good opportunities for us to build UD POS taggers for various languages....

J38 commented 7 years ago

@Tpt have you gotten reasonable French results with your workaround? I have built a new part-of-speech tagger in TensorFlow, it might be interesting to train a UD French part-of-speech model. I think it'd be cool to fix this French issue for Stanford CoreNLP 3.8.0.

Tpt commented 7 years ago

@J38 Yes, French results are fairly good (but not amazing). See this live version: http://corenlp.askplatyp.us/1.7

I have built a new part-of-speech tagger in TensorFlow, it might be interesting to train a UD French part-of-speech model.

Great!

J38 commented 7 years ago

Ok I have made a first attempt at training a French UD POS tagging model.

It is available at this path: edu/stanford/nlp/models/pos-tagger/french/french-ud.tagger in the French models jar. (you will need to download the latest French models jar from the main GitHub page). This was trained on some French UD treebank data we have. It had a 94.4% accuracy on a corresponding French UD treebank test set.

If you have a chance @Tpt would you please try it out ? Over the next few weeks we will probably work on making better models for French and Spanish to include in the Stanford CoreNLP 3.8.0 release! This model was trained with a CRF and works with the Java code.

I am going to try to review the landscape and figure out what the canonical French POS tagging and NER datasets/results are currently. If you happen to know of any papers/systems to look at that would be very helpful. Soon I want to try to train my Bi-LSTM model for French and Spanish and release that with close to state of the art results.

Tpt commented 7 years ago

@J38 Thank you very much for having worked on an other model! How could I run it? I tried to replace the default model by this new one in the CoreNLP server config [1] but it seems that the server still serves the previous tagger. Have I done something wrong?

About datasets on French, as my knowledge there is:

[1] https://github.com/askplatypus/CoreNLP/commit/f48c05e990eab5a2e4ed1abb00d804af4e814879

up4 commented 7 years ago

Hello! I just read this thread and I have questions:

  1. Is there more than one French model for Core NLP 3.8 available, besides the official one ?
  2. What is the status of the Core NLP French model(s) as far as Universal Dependencies 2.0 is concerned ?
  3. Is it the official policy for foreign language models to all support UD 2.0 ? If so what is the target Core NLP version for this multilingual alignment ?
  4. Will multi-lingual UD 2.0 alignment ease implementation of the extra features not currently supported for all languages, including French (NER, IE, CoRef, etc.) ? If so, can I help ? If so, where do I start ?

Regards,

Vincent

J38 commented 6 years ago

Hi @Tpt I'm sorry it's taken so long to get back to you on this! If you start your server with this option added: -serverProperties server_properties.prop and then make a file called server_properties.prop with these properties:

annotators = tokenize, ssplit, pos, depparse

tokenize.language = fr

pos.model = edu/stanford/nlp/models/pos-tagger/french/french-ud.tagger

parse.model = edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz

# dependency parser
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_French.gz
depparse.language = french

It should start the server with those specific properties and load the new UD French POS tagger.

J38 commented 6 years ago

@up4

  1. There is only one French model jar released for 3.8.0. It contains two different versions of the French POS tagger, the original and a new one trained on UD data.

2 & 3. I believe currently things have been trained on UD 1.3. Eventually we will retrain things on UD 2.0. There is a new dependency parser that we will be trying to integrate into CoreNLP, and that will probably use the UD 2.0 dependencies.

  1. I would actually say the major obstacle for French NER and Coref is I am not aware of a data set we could use to train a model. If you know of any publicly available French NER or Coref data sets, that would help and we could start to look at training some models for those tasks.
up4 commented 6 years ago

@J38 1) Ok thanks ! I ended up getting them to work and experiment with both on French Canadian news sources… 2-3) Is there something I can read or someone I can talk to in order to learn how to retrain the French CoreNLP on UD 2.0 myself ? 4) A quick googling of "french coref data set" got me the link for ANCOR_Centre but I think it's spoken French only. I would like to create my own UD 2.0 entry from written (and eventually spoken) French Canadian news sources, how do I make sure it's properly "coref-annotated" so it's useable to train a French NER model ?

Thanks for your time.

ps: I found this article from 2010 about the lack of satisfying French coref training corpus.

J38 commented 6 years ago

This command should allow for training a new model:

java -Xmx4g edu.stanford.nlp.tagger.maxent.MaxentTagger -props french-ud.tagger.props

You can find the props file used here:

https://github.com/stanfordnlp/CoreNLP/blob/master/scripts/pos-tagger/french-ud.tagger.props

You will of course need to change the path to wherever you download and store the french-ud data. Note also that the data is in .conll format, so it should be one token per line, each line of the form word\tpos_tag ... sentences separated by a blank line.

maziyarpanahi commented 5 years ago

Hi,

I have a question regarding your French UD model. Here you mentioned you trained it from “some french-ud treebanks” you had. Also, in the script says ‘french-ud-train.conll’. May I know which one of following French UD treebanks are you using to deliver your french-ud model? http://universaldependencies.org/fr/index.html

As an example, it would be nice to have some stats about these models like spacy (I’ve never used it myself, it just looks informative): https://spacy.io/models/fr