Open alexanderpanchenko opened 9 years ago
The following error was thrown when I tried to run the ukwac+wacky combined csv file. The error occurred after 25374000 lines were processed.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lang/exception/ExceptionUtils at de.tudarmstadt.ukp.dkpro.core.api.resources.ResourceObjectProviderBase.handleResolvingError(ResourceObjectProviderBase.java:853) at de.tudarmstadt.ukp.dkpro.core.api.resources.ResourceObjectProviderBase.configure(ResourceObjectProviderBase.java:504) at de.tudarmstadt.ukp.dkpro.core.api.resources.CasConfigurableProviderBase.configure(CasConfigurableProviderBase.java:36) at de.tudarmstadt.ukp.dkpro.core.api.resources.MappingProvider.configure(MappingProvider.java:53) at de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpPosTagger.process(OpenNlpPosTagger.java:172) at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:280) at de.tudarmstadt.lt.jobimify.JoBimificator.process(JoBimificator.java:176) at de.tudarmstadt.lt.jobimify.JoBimificator.processFile(JoBimificator.java:317) at de.tudarmstadt.lt.jobimify.JoBimificator.main(JoBimificator.java:395) Caused by: java.lang.ClassNotFoundException: org.apache.commons.lang.exception.ExceptionUtils at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more
try to process the rest of the file
use head / tail commands to cut the part of the corpus already processed
sed -i.bak s/\t\t/\t/g corpus.txt
cut -f 5 u | tr '\n' ' '
Please, lemmatize this corpus (about 59Gb) as described above: frink:/srv/data/en59g/*.txt
Use this corpus for training a model. Please complete the task by next Friday and consider that lemmatization/training will take a lot of time, so start early.
Note: now copying process is in progress, do not start lemmatization unless you have 4 files in the directory of 59 Gb size in total.
update -- the data are now on the server. you can start lemmatization.
There are four files in the directory en59g. Should I lemmatize them one by one or use *.txt?
It says permission denied when I tried to run it against *.txt
java.io.FileNotFoundException: /srv/data/en59g/en_news-2014.txt.csv (Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.
Have I been given the permission to access the files?
this is correct, you shouldn't write anything in this directory, only read specify output in your home with the -o key
I have copied the same set of files on my directory and started the lemmatization process.
normally you shouln't copy the input files, rather specify the output in your home
All the processes have stopped due to a lack of space. I got the following error on all the sessions I was running.
I got the following exception: java.io.IOException : No space left on device.
I had to remove the previous corpuses on the disk including the latest ukwack_wacky_corpus that was getting generated because it took more than 120GB of space. Should I begin the process from the beginning or tail the files and attach it to the already generated lemmatized corpus?
The files have been lemmatized. Let me know once you check the files for its correctness. I'll start training the model. They are in the location : /srv/data/en59g/
lemmatize the ukwac+wacky corpus using Jobimify tool:
Get the lemmatized text
train the new version of the model
train with several alpha parameters
Train 5 models with the following values: 0.1, 0.2, 0.4, 0.5, 0.75
params of the train.JL