tudarmstadt-lt / GermaNER

GermaNER: Free Open German Named Entity Recognition Tool
Other
36 stars 13 forks source link

Error when using multithreading #3

Open katjahauser opened 8 years ago

katjahauser commented 8 years ago

Hello,

I'm running several threads each executing GermaNER and keep getting occasional

java.lang.IllegalStateException: The number of extracted classified labels is not equivalent with the number of instanzes (0!=632)

exceptions (with varying numbers of instances) caused by

 at org.cleartk.ml.crfsuite.CrfSuiteWrapper.classifyFeatures(CrfSuiteWrapper.java:235)

at org.cleartk.ml.crfsuite.CrfSuiteWrapper.classifyFeatures(CrfSuiteWrapper.java:304) at org.cleartk.ml.crfsuite.CrfSuiteStringOutcomeClassifier.classify(CrfSuiteStringOutcomeClassifier.java:79) at org.cleartk.ml.CleartkSequenceAnnotator.classify(CleartkSequenceAnnotator.java:191) at de.tu.darmstadt.lt.ner.annotator.NERAnnotator.classify(NERAnnotator.java:188) at de.tu.darmstadt.lt.ner.annotator.NERAnnotator.process(NERAnnotator.java:178) at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385) ... 6 more

I do not encounter this problem when running GermaNER with only one thread.

Having had a look at CrfSuiteWrapper I'd suppose this error stems from a race condition related to the occasionally(?) used temporary file. Can you reproduce this error and confirm my assumption? I think a very simple workaround would be to simply add a random number to the name of the temp-file to avoid the race condition. Do you have other ideas how to avoid this or any tips in regard to running multiple instances of GermaNER in parallel?

Best regards, Katja

seyyaw commented 7 years ago

@katjahauser Can you share us your multi-thread code you run for GermaNER? This will help us to quickly test the bug and find out a solution Thanks

katjahauser commented 7 years ago

Hey there,

you find attached the Python script I used.

It connects with a MongoDB and converts articles into the format GermaNER needs, so there is a little overhead. I think most relevant for you is the method "callGermaNER" in the lines 15ff and the part where I actually call it in the lines 142ff. These pieces of code can be extracted an run without the conversion overhead beforehand, I don't recall any dependencies. Notice, that while the program converts all articles in the collection, it only uses GermaNER on two files (the latter can be changed in l. 147). You will want to use a larger number of files (e.g. 100) to reconstruct the error.

If you have any questions feel free to contact me.

Best, Katja Hauser

On 04.12.2016 22:24, Seid Muhie Yimam wrote:

@katjahauser https://github.com/katjahauser Can you share us your multi-thread code you run for GermaNER? This will help us to quickly test bug and find out a solution Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tudarmstadt-lt/GermaNER/issues/3#issuecomment-264732356, or mute the thread https://github.com/notifications/unsubscribe-auth/AT-XWFoRuFFFTtK8Epipl8jM2klNn1GFks5rEy9xgaJpZM4Jg7Co.

OfferFuture commented 7 years ago

I encounter this problem compiling cleartk as well. The CrfSuiteClassiferTest failed due to this error.