stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.68k stars 2.7k forks source link

Training own true case models #336

Open aliabbasjp opened 7 years ago

aliabbasjp commented 7 years ago

Hi, There seems to be many inconsistencies in the truce casing model , hence I need to retrain it on my data , how can I train the same using my own training data? I need examples for the following training settings, i.e what is the format of noUN.input: https://github.com/jnorthrup/stanford-corenlp/blob/master/src/main/resources/edu/stanford/nlp/models/truecase/truecasing.fast.prop

serializeTo=truecasing.fast.qn.ser.gz
trainFileList=/scr/nlp/data/gale/NIST09/truecaser/crf/noUN.input
testFile=/scr/nlp/data/gale/AE-MT-eval-data/mt06/cased/ref0
AngledLuffa commented 7 years ago

If you help us find errors in the data, we can also use them to train better models ourselves, although there is no time frame for when that would happen.

On Tue, Jan 10, 2017 at 12:03 AM, Aliabbas notifications@github.com wrote:

Hi, There seems to be many inconsistencies in the truce casing model , hence I need to retrain it on my data , how can I train the same using my own training data?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/336, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQMWfoRcUttznoe1qhnNR_su43Ww6aqks5rQztngaJpZM4LfIQr .

aliabbasjp commented 7 years ago

@AngledLuffa I dont have the possession of the data trainFileList=/scr/nlp/data/gale/NIST09/truecaser/crf/noUN.input

where can I find this? In my case the following input:

3. l loss changed to:

3. L LOSS

aliabbasjp commented 7 years ago

@AngledLuffa Also can you guide me on the input data training format or a sample of the above training file?

AngledLuffa commented 7 years ago

This example seems kind of short, not much to go on. The truecaser is meant for actual sentences.

The input file can't be shared, unfortunately. The format is properly capitalized sentences tokenized like this:

It should be noted that the Iraqi president , whose father died when he was very young , has four half - brothers from his mother who married his uncle .

On Tue, Jan 10, 2017 at 2:56 AM, Aliabbas notifications@github.com wrote:

@AngledLuffa https://github.com/AngledLuffa I dont have the possession of the data trainFileList=/scr/nlp/data/gale/NIST09/truecaser/crf/ noUN.input

where can I find this? In my case the following input:

  1. l loss changed to:

  2. L LOSS

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/336#issuecomment-271546190, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQMWV41JbJGnhdCDM2kfsSJ7ETjRrXVks5rQ2P5gaJpZM4LfIQr .

manning commented 7 years ago

Here's the command we used to train our model. The prop file is included in the jar. As John commented, we can't distribute the training data. But it is just already tokenized (whitespace-separated), one sentence per line text.

truecasing.fast.caseless.qn.ser.gz: truecasing.fast.caseless.prop $(JAVA) -mx110g edu.stanford.nlp.ie.crf.CRFClassifier -prop $^ -serializeTo $@ -multiThreadGrad 8 > $(addsuffix .out, $(basename $^)) 2> $(addsuffix .err, $(basename $^))

I agree the data isn't very good. It's decade old data from an MT project.

aliabbasjp commented 7 years ago

@manning What is the maximum train text data size for a 350 gb RAM machine with 40 cores?, I tried a 900MB file it crashed after 12 days.

Exception: Java heap space: failed reallocation of scalar replaced objects

will -mx300g help?

DmitryGrayscale commented 7 years ago

Tried to use my own model:

> java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props English.properties -file input.txt
Setting bias for class INIT_UPPER to 0.0
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
        at edu.stanford.nlp.ie.crf.CRFBiasedClassifier.setBiasWeight(CRFBiasedClassifier.java:112)
        at edu.stanford.nlp.ie.crf.CRFBiasedClassifier.setBiasWeight(CRFBiasedClassifier.java:106)

What I'm doing wrong?