train a model on a lemmatized corpus

alexanderpanchenko commented 9 years ago

lemmatize the ukwac+wacky corpus using Jobimify tool:

frink:/home/panchenko/jobimify

use the concatenation of these corpora http://cental.fltr.ucl.ac.be/team/~panchenko/data/corpora/
use the following command to lemmatize the text

panchenko@frink:~/jobimify$ java -jar joint-0.1.jar corpus/wiki-1.txt -wsd false -twsi false -dp false -g false -m true -ner false -l true -t false -tt false
- Output file:
- Print input text: false
- Print input offset: true
- Morphologial analysis: true
- Global similar terms: false
- Lowercase lemmas: true
- Print similarity score: false
- Type of JoBimText backend: db
- Contextualization (WSD): false
- Contextualization, max number of Bims: 10
- Double tabs: false
- Dependency parse: false
- Named entity recognition: false
- TWSI: false
- Input file: corpus/wiki-1.txt
- Initializing...
- 0 lines processed
- Producing resource from jar:file:/srv/home/panchenko/jobimify/joint-0.1.jar!/de/tudarmstadt/ukp/dkpro/core/opennlp/lib/tagger-en-maxent.bin
- Producing resource took 1168ms
- Producing resource from jar:file:/srv/home/panchenko/jobimify/joint-0.1.jar!/de/tudarmstadt/ukp/dkpro/core/api/lexmorph/tagset/en-default-pos.map
- Producing resource took 1ms
- 1 lines processed
- Output file: corpus/wiki-1.txt.csv

Get the lemmatized text

cut -f 3 corpus/wiki-1.txt.csv | tr '\n' ' ' > lemmatized-corpus.txt

train the new version of the model
```
train.jl lemmatized-corpus.txt
```
train with several alpha parameters

Train 5 models with the following values: 0.1, 0.2, 0.4, 0.5, 0.75

params of the train.JL
```
"--alpha"
 help = "prior probability of allocating a new prototype"
 arg_type = Float64
 default = 0.1
```

suhasbhairav commented 9 years ago

The following error was thrown when I tried to run the ukwac+wacky combined csv file. The error occurred after 25374000 lines were processed.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lang/exception/ExceptionUtils at de.tudarmstadt.ukp.dkpro.core.api.resources.ResourceObjectProviderBase.handleResolvingError(ResourceObjectProviderBase.java:853) at de.tudarmstadt.ukp.dkpro.core.api.resources.ResourceObjectProviderBase.configure(ResourceObjectProviderBase.java:504) at de.tudarmstadt.ukp.dkpro.core.api.resources.CasConfigurableProviderBase.configure(CasConfigurableProviderBase.java:36) at de.tudarmstadt.ukp.dkpro.core.api.resources.MappingProvider.configure(MappingProvider.java:53) at de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpPosTagger.process(OpenNlpPosTagger.java:172) at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:280) at de.tudarmstadt.lt.jobimify.JoBimificator.process(JoBimificator.java:176) at de.tudarmstadt.lt.jobimify.JoBimificator.processFile(JoBimificator.java:317) at de.tudarmstadt.lt.jobimify.JoBimificator.main(JoBimificator.java:395) Caused by: java.lang.ClassNotFoundException: org.apache.commons.lang.exception.ExceptionUtils at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more

alexanderpanchenko commented 9 years ago

try to process the rest of the file

use head / tail commands to cut the part of the corpus already processed

suhasbhairav commented 9 years ago

sed -i.bak s/\t\t/\t/g corpus.txt

suhasbhairav commented 9 years ago

cut -f 5 u | tr '\n' ' '

alexanderpanchenko commented 9 years ago

Please, lemmatize this corpus (about 59Gb) as described above: frink:/srv/data/en59g/*.txt

Use this corpus for training a model. Please complete the task by next Friday and consider that lemmatization/training will take a lot of time, so start early.

Note: now copying process is in progress, do not start lemmatization unless you have 4 files in the directory of 59 Gb size in total.

alexanderpanchenko commented 9 years ago

update -- the data are now on the server. you can start lemmatization.

suhasbhairav commented 9 years ago

There are four files in the directory en59g. Should I lemmatize them one by one or use *.txt?

suhasbhairav commented 9 years ago

It says permission denied when I tried to run it against *.txt

java.io.FileNotFoundException: /srv/data/en59g/en_news-2014.txt.csv (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:142) at java.io.FileWriter.(FileWriter.java:78) at de.tudarmstadt.lt.jobimify.JoBimificator.processFile(JoBimificator.java:329) at de.tudarmstadt.lt.jobimify.JoBimificator.main(JoBimificator.java:432)

Have I been given the permission to access the files?

alexanderpanchenko commented 9 years ago

this is correct, you shouldn't write anything in this directory, only read specify output in your home with the -o key

suhasbhairav commented 9 years ago

I have copied the same set of files on my directory and started the lemmatization process.

alexanderpanchenko commented 9 years ago

normally you shouln't copy the input files, rather specify the output in your home

suhasbhairav commented 9 years ago

All the processes have stopped due to a lack of space. I got the following error on all the sessions I was running.

I got the following exception: java.io.IOException : No space left on device.

I had to remove the previous corpuses on the disk including the latest ukwack_wacky_corpus that was getting generated because it took more than 120GB of space. Should I begin the process from the beginning or tail the files and attach it to the already generated lemmatized corpus?

suhasbhairav commented 9 years ago

The files have been lemmatized. Let me know once you check the files for its correctness. I'll start training the model. They are in the location : /srv/data/en59g/

suhasbhairav / JuliaMachineLearning

train a model on a lemmatized corpus #7