google-code-export / dkpro-tc

Automatically exported from code.google.com/p/dkpro-tc
Other
1 stars 0 forks source link

Running LuceneNGramUFE #166

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hello guys, 

Has anyone tried to run this feature extractor (LuceneNGramUFE) and worked for 
them? It's been throwing an error by: 

"org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator 
processing failed."

, and 

"java.lang.IllegalArgumentException: value cannot be null"

Original issue reported on code.google.com by alot...@gmail.com on 17 Jul 2014 at 10:46

GoogleCodeExporter commented 9 years ago
Can you provide a stack trace please?

Original comment by richard.eckart on 17 Jul 2014 at 10:47

GoogleCodeExporter commented 9 years ago
Exception in thread "main" 
de.tudarmstadt.ukp.dkpro.lab.engine.ExecutionException: 
de.tudarmstadt.ukp.dkpro.lab.engine.ExecutionException: 
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator 
processing failed.    
    at de.tudarmstadt.ukp.dkpro.lab.engine.impl.ExecutableTaskEngine.run(ExecutableTaskEngine.java:68)
    at de.tudarmstadt.ukp.dkpro.lab.engine.impl.DefaultTaskExecutionService.run(DefaultTaskExecutionService.java:48)
    at de.tudarmstadt.ukp.dkpro.lab.Lab.run(Lab.java:97)
    at de.tudarmstadt.ukp.experiments.AA.VSDtoTC.main.VSD_Runner2.runTrainTest(VSD_Runner2.java:153)
    at de.tudarmstadt.ukp.experiments.AA.VSDtoTC.main.VSD_Runner2.main(VSD_Runner2.java:84)
Caused by: de.tudarmstadt.ukp.dkpro.lab.engine.ExecutionException: 
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator 
processing failed.    
    at de.tudarmstadt.ukp.dkpro.lab.uima.engine.simple.SimpleExecutionEngine.run(SimpleExecutionEngine.java:178)
    at de.tudarmstadt.ukp.dkpro.lab.task.impl.BatchTask.runNewExecution(BatchTask.java:350)
    at de.tudarmstadt.ukp.dkpro.lab.task.impl.BatchTask.executeConfiguration(BatchTask.java:255)
    at de.tudarmstadt.ukp.dkpro.lab.task.impl.BatchTask.execute(BatchTask.java:185)
    at de.tudarmstadt.ukp.dkpro.tc.weka.task.BatchTaskTrainTest.execute(BatchTaskTrainTest.java:86)
    at de.tudarmstadt.ukp.dkpro.lab.engine.impl.ExecutableTaskEngine.run(ExecutableTaskEngine.java:55)
    ... 4 more
Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException: 
Annotator processing failed.    
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:394)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:298)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:568)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:410)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:343)
    at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:265)
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
    at de.tudarmstadt.ukp.dkpro.lab.uima.engine.simple.SimpleExecutionEngine.run(SimpleExecutionEngine.java:141)
    ... 9 more
Caused by: java.lang.IllegalArgumentException: value cannot be null
    at org.apache.lucene.document.Field.<init>(Field.java:239)
    at org.apache.lucene.document.StringField.<init>(StringField.java:60)
    at de.tudarmstadt.ukp.dkpro.tc.features.ngram.meta.LuceneBasedMetaCollector.initializeDocument(LuceneBasedMetaCollector.java:99)
    at de.tudarmstadt.ukp.dkpro.tc.features.ngram.meta.LuceneBasedMetaCollector.process(LuceneBasedMetaCollector.java:112)
    at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:378)
    ... 16 more

Original comment by alot...@gmail.com on 17 Jul 2014 at 10:49

GoogleCodeExporter commented 9 years ago
Looks like the documentTitle is not set in the DocumentMetaData annotation of 
the CAS:

DocumentMetaData.get(jcas).setDocumentTitle(...);

Original comment by richard.eckart on 17 Jul 2014 at 11:51

GoogleCodeExporter commented 9 years ago
What kind of Preprocessing did you run?

Original comment by daxenber...@gmail.com on 17 Jul 2014 at 11:51

GoogleCodeExporter commented 9 years ago
Nothing actually, (NoOpAnnotator.class).

Also, I set the setDocumentTitle as in:
DocumentMetaData.get(aJCas).setDocumentTitle(document.getName());//Added

Original comment by alot...@gmail.com on 18 Jul 2014 at 8:45

GoogleCodeExporter commented 9 years ago
To run an NGram feature extractor, you need to have at least sentence and token 
annotations in your CAS. If you did not supply them via the reader, you need to 
add preprocessing components which will do the job.

Original comment by daxenber...@gmail.com on 18 Jul 2014 at 2:38

GoogleCodeExporter commented 9 years ago
The input data is already structured, and have annotations such as POS, Lemma, 
and Constituent. What's been taking long is the creation of files and the 
iteration over those files. It seems like UnitClassification creates a file for 
every classification unit in the data (WSDitem). and then iterates over them in 
the FE part. Such iteration says:
"MetaInfoTask ... Progress 
de.tudarmstadt.ukp.dkpro.core.io.bincas.BinaryCasReader 513/998 file"

However, with more data it complains about memory-size. Is there someway of 
preventing UnitClassification from the file creation? And do the meta 
collecting in some other way?

Original comment by alot...@gmail.com on 22 Jul 2014 at 9:30

GoogleCodeExporter commented 9 years ago
The meta-extraction task is run in this way for a reason, as this is the only 
way to ensure that there is no information leak between train/test.

In order to determine why exactly you run into memory problems, it would be 
necessary to better understand what is going on. Please profile the memory 
usage and give some more pointers on where the memory is consumed.

Original comment by torsten....@gmail.com on 22 Jul 2014 at 9:37

GoogleCodeExporter commented 9 years ago
What is the version of uimaj-core on your classpath when you get the memory 
problems?

Original comment by richard.eckart on 22 Jul 2014 at 9:46

GoogleCodeExporter commented 9 years ago
@torsten: stack trace screen is attached. Crashes at Meta Extraction Task.

@richard.eckart: .classpath file is attached.

Original comment by alot...@gmail.com on 23 Jul 2014 at 8:12

Attachments:

GoogleCodeExporter commented 9 years ago
how much memory did you assign to that run? 
In unit classification mode, *each* classification unit will get its own CAS in 
the remaining pipeline. If you want to prevent that, you have to limit the 
classification unit annotations in the reader.

Original comment by daxenber...@gmail.com on 23 Jul 2014 at 9:11

GoogleCodeExporter commented 9 years ago
This looks like an error in UIMA that was fixed some time ago.

Unfortunately, your answer does not provide the information about the 
uimaj-core version that Richard was asking about.

Could you please provide that.

Original comment by torsten....@gmail.com on 23 Jul 2014 at 9:26

GoogleCodeExporter commented 9 years ago
The .classfile only contains a reference to Maven but does not state what 
dependencies Maven injects. You need to look that up either in the pom.xml file 
(dependency hierarchy) or for a really definitive answer run your project in 
debug mode and look at the classpath set on the running debug instance. Then 
please just tell us the version that you believe is being used.

Please do not send screenshots unless we ask for them. Instead, please 
copy/paste the error message text here - this will also make it easier for 
other people with a similar problem to find the issue via a web search.

Original comment by richard.eckart on 23 Jul 2014 at 10:21

GoogleCodeExporter commented 9 years ago
@daxenber: 

In "eclipse.ini" I have:
-Xms512m
-Xmx2048m
-XX:PermSize=512M
-XX:MaxPermSize=2048M

Also, During reading at getNext(JCas aJCas), I add: unit.addToIndexes(); & 
outcome.addToIndexes(); I'll need existing annotations during feature 
extraction. 
No preprocessing is needed mostly. Are you suggesting to discard all 
annotations (except for unit/outcome) during reading?

@torsten & @richard.eckart: 

Effective POM has:
<dependency>
    <groupId>org.apache.uima</groupId>
    <artifactId>uimaj-core</artifactId>
    <version>2.4.2</version>
</dependency>

Original comment by alot...@gmail.com on 23 Jul 2014 at 10:29

GoogleCodeExporter commented 9 years ago
Check what other indirect uimaj dependencies you have and add all of them using 
version 2.6.0 to you POM - that should fix the memory leak with the binary CAS.

For reference the related bug in UIMA: 
https://issues.apache.org/jira/browse/UIMA-3747

The eclipse.ini settings do not affect programs you start within Eclipse - the 
memory settings for those are in their respective "Run configurations" 
accessible via the "Run..." menu in Eclipse.

Original comment by richard.eckart on 23 Jul 2014 at 10:32

GoogleCodeExporter commented 9 years ago
Only the TextClassificationUnit annotations will be used to split existing 
documents into several CAS (the other annotations don't matter here). If you 
need all of them for FeatureExtraction and Classification, there's no way 
around that.

Original comment by daxenber...@gmail.com on 23 Jul 2014 at 10:40

GoogleCodeExporter commented 9 years ago
Any updates on this issue?

Original comment by daxenber...@gmail.com on 29 Jul 2014 at 9:52

GoogleCodeExporter commented 9 years ago
Not from my side. Converted to FrequencyDistribution FEs instead. 

Original comment by alot...@gmail.com on 29 Jul 2014 at 12:18

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 29 Jul 2014 at 12:44

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 14 Aug 2014 at 2:41