stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.58k stars 2.7k forks source link

How big is your truecase model? #986

Open erksch opened 4 years ago

erksch commented 4 years ago

Hey there!

I've trained a truecase model for the german language on a dataset of 1 million sentences. The resulting model is quite big (80MB) and I am having memory issues including it into my annotating pipeline. But when I use the english truecase model there are no issues.

How big is your truecase model (edu/stanford/nlp/models/truecase/truecasing.fast.caseless.qn.ser.gz)? And how does the annotator impact the memory consumption of the pipeline?

erksch commented 4 years ago

Ok I unpacked your jar and found out: truecasing.fast.caseless.qn.ser.gz - 15.8 MB

But how is this possible? You are training on 4.5 million sentences and I am training on only 1 million. I have the exact same configuration except for:

useQN=false (I have true)
l1reg=1.0 (I don't have this line)

because I read somewhere that I can only use QN Minimizer and it is throwing an error when I use that configuration.

Is it because maybe the german language has more words in general (don't know if that's true)?

erksch commented 4 years ago
This is my whole training configration ``` serializeTo=truecasing.fast.caseless.qn.ser.gz trainFileList=data.train testFile=data.test map=word=0,answer=1 wordFunction = edu.stanford.nlp.process.LowercaseFunction useClassFeature=true useWord=true useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useLongSequences=true useSequences=true usePrevSequences=true useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true useOccurrencePatterns=true useLastRealWord=true useNextRealWord=true useDisjunctive=true disjunctionWidth=5 wordShape=chris2useLC usePosition=true useBeginSent=true useTitle=true useObservedSequencesOnly=true saveFeatureIndexToDisk=true normalize=true useQN=true QNSize=25 maxLeft=1 readerAndWriter=edu.stanford.nlp.sequences.TrueCasingForNISTDocumentReaderAndWriter featureFactory=edu.stanford.nlp.ie.NERFeatureFactory featureDiffThresh=0.02 ```
AngledLuffa commented 4 years ago

What happens if you add back the l1reg? That should force weights to 0, which should reduce the size of the final model.

Also, I retrained the model recently on 1.5M sentences and the resulting model is significantly bigger, at 48M.

On Mon, Jan 20, 2020 at 6:46 AM erksch notifications@github.com wrote:

This is my whole training configration:

serializeTo=truecasing.fast.caseless.qn.ser.gz trainFileList=data.train testFile=data.test

map=word=0,answer=1

wordFunction = edu.stanford.nlp.process.LowercaseFunction

useClassFeature=true useWord=true useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useLongSequences=true useSequences=true usePrevSequences=true useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true useOccurrencePatterns=true useLastRealWord=true useNextRealWord=true useDisjunctive=true disjunctionWidth=5 wordShape=chris2useLC usePosition=true useBeginSent=true useTitle=true

useObservedSequencesOnly=true saveFeatureIndexToDisk=true normalize=true

useQN=true QNSize=25

maxLeft=1

readerAndWriter=edu.stanford.nlp.sequences.TrueCasingForNISTDocumentReaderAndWriter featureFactory=edu.stanford.nlp.ie.NERFeatureFactory

featureDiffThresh=0.02

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/986?email_source=notifications&email_token=AA2AYWP2HMDF63S4VSGNG6TQ6W2LBA5CNFSM4KJEKAWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJM3VKQ#issuecomment-576305834, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLQAJWP7ALMX6NO74TQ6W2LBANCNFSM4KJEKAWA .

erksch commented 4 years ago

When I add l1reg I get the following error:

Exception in thread "main" edu.stanford.nlp.util.ReflectionLoading$ReflectionLoadingException: Error creating edu.stanford.nlp.optimization.OWLQNMinimizer
    at edu.stanford.nlp.util.ReflectionLoading.loadByReflection(ReflectionLoading.java:38)
    at edu.stanford.nlp.ie.crf.CRFClassifier.getMinimizer(CRFClassifier.java:2003)
    at edu.stanford.nlp.ie.crf.CRFClassifier.trainWeights(CRFClassifier.java:1902)
    at edu.stanford.nlp.ie.crf.CRFClassifier.train(CRFClassifier.java:1742)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:785)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:756)
    at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3011)
Caused by: edu.stanford.nlp.util.MetaClass$ClassCreationException: java.lang.ClassNotFoundException: edu.stanford.nlp.optimization.OWLQNMinimizer
    at edu.stanford.nlp.util.MetaClass.createFactory(MetaClass.java:364)
    at edu.stanford.nlp.util.MetaClass.createInstance(MetaClass.java:381)
    at edu.stanford.nlp.util.ReflectionLoading.loadByReflection(ReflectionLoading.java:36)
    ... 6 more
Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.optimization.OWLQNMinimizer
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:582)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
    at java.base/java.lang.Class.forName0(Native Method)
    at java.base/java.lang.Class.forName(Class.java:315)
    at edu.stanford.nlp.util.MetaClass$ClassFactory.construct(MetaClass.java:135)
    at edu.stanford.nlp.util.MetaClass$ClassFactory.<init>(MetaClass.java:202)
    at edu.stanford.nlp.util.MetaClass$ClassFactory.<init>(MetaClass.java:69)
    at edu.stanford.nl

According to the things I read this tries to use the OWLQNMinimizer which is not publicly available in CoreNLP thus the class is not found.

Right, we are not licensed to release that optimizer. [...] add the flag useQN=true

It turns out you also need to turn of l1reg (remove the l1reg=...flag) to use the qn implementation. For all I know, turning off the regularization may make the classifier much worse, unfortunately.

From this thread

AngledLuffa commented 4 years ago

Ah, that's a good point. I'll check with our PI to see if things have changed in terms of what we can publicly release.

On Mon, Jan 20, 2020 at 9:42 AM erksch notifications@github.com wrote:

When I add l1reg I get the following error:

Exception in thread "main" edu.stanford.nlp.util.ReflectionLoading$ReflectionLoadingException: Error creating edu.stanford.nlp.optimization.OWLQNMinimizer at edu.stanford.nlp.util.ReflectionLoading.loadByReflection(ReflectionLoading.java:38) at edu.stanford.nlp.ie.crf.CRFClassifier.getMinimizer(CRFClassifier.java:2003) at edu.stanford.nlp.ie.crf.CRFClassifier.trainWeights(CRFClassifier.java:1902) at edu.stanford.nlp.ie.crf.CRFClassifier.train(CRFClassifier.java:1742) at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:785) at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:756) at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3011) Caused by: edu.stanford.nlp.util.MetaClass$ClassCreationException: java.lang.ClassNotFoundException: edu.stanford.nlp.optimization.OWLQNMinimizer at edu.stanford.nlp.util.MetaClass.createFactory(MetaClass.java:364) at edu.stanford.nlp.util.MetaClass.createInstance(MetaClass.java:381) at edu.stanford.nlp.util.ReflectionLoading.loadByReflection(ReflectionLoading.java:36) ... 6 more Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.optimization.OWLQNMinimizer at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:582) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:315) at edu.stanford.nlp.util.MetaClass$ClassFactory.construct(MetaClass.java:135) at edu.stanford.nlp.util.MetaClass$ClassFactory.(MetaClass.java:202) at edu.stanford.nlp.util.MetaClass$ClassFactory.(MetaClass.java:69) at edu.stanford.nl

According to the things I read this tries to use the OWLQNMinimizer which is not publicly available in CoreNLP thus the class is not found.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/986?email_source=notifications&email_token=AA2AYWLLVSFKHKLNCB6NK6TQ6XO7JA5CNFSM4KJEKAWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJNMNAI#issuecomment-576374401, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMV2Y6HDRJ766KKOGDQ6XO7JANCNFSM4KJEKAWA .

erksch commented 4 years ago

Nice, thank you very much! Have you trained with the normal QNMinimizer in your recent training?

AngledLuffa commented 4 years ago

Sorry (again) for the late reply. What should work is the parameters

useQN=true useOWLQN=true priorLambda= (some hyperparameter)

erksch commented 4 years ago

Thank you! I think I'll retrain our truecaser in the next days and give the parameters a try!