suhasbhairav / JuliaMachineLearning

A program that makes use of AdaGram and generates results
0 stars 0 forks source link

train a model on a lemmatized corpus #7

Open alexanderpanchenko opened 9 years ago

alexanderpanchenko commented 9 years ago
  1. lemmatize the ukwac+wacky corpus using Jobimify tool:

    frink:/home/panchenko/jobimify 
    panchenko@frink:~/jobimify$ java -jar joint-0.1.jar corpus/wiki-1.txt -wsd false -twsi false -dp false -g false -m true -ner false -l true -t false -tt false
    - Output file:
    - Print input text: false
    - Print input offset: true
    - Morphologial analysis: true
    - Global similar terms: false
    - Lowercase lemmas: true
    - Print similarity score: false
    - Type of JoBimText backend: db
    - Contextualization (WSD): false
    - Contextualization, max number of Bims: 10
    - Double tabs: false
    - Dependency parse: false
    - Named entity recognition: false
    - TWSI: false
    - Input file: corpus/wiki-1.txt
    - Initializing...
    - 0 lines processed
    - Producing resource from jar:file:/srv/home/panchenko/jobimify/joint-0.1.jar!/de/tudarmstadt/ukp/dkpro/core/opennlp/lib/tagger-en-maxent.bin
    - Producing resource took 1168ms
    - Producing resource from jar:file:/srv/home/panchenko/jobimify/joint-0.1.jar!/de/tudarmstadt/ukp/dkpro/core/api/lexmorph/tagset/en-default-pos.map
    - Producing resource took 1ms
    - 1 lines processed
    - Output file: corpus/wiki-1.txt.csv
  2. Get the lemmatized text

    cut -f 3 corpus/wiki-1.txt.csv | tr '\n' ' ' > lemmatized-corpus.txt
  3. train the new version of the model

    train.jl lemmatized-corpus.txt
  4. train with several alpha parameters

    Train 5 models with the following values: 0.1, 0.2, 0.4, 0.5, 0.75

    params of the train.JL

    "--alpha"
     help = "prior probability of allocating a new prototype"
     arg_type = Float64
     default = 0.1
suhasbhairav commented 9 years ago

The following error was thrown when I tried to run the ukwac+wacky combined csv file. The error occurred after 25374000 lines were processed.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lang/exception/ExceptionUtils at de.tudarmstadt.ukp.dkpro.core.api.resources.ResourceObjectProviderBase.handleResolvingError(ResourceObjectProviderBase.java:853) at de.tudarmstadt.ukp.dkpro.core.api.resources.ResourceObjectProviderBase.configure(ResourceObjectProviderBase.java:504) at de.tudarmstadt.ukp.dkpro.core.api.resources.CasConfigurableProviderBase.configure(CasConfigurableProviderBase.java:36) at de.tudarmstadt.ukp.dkpro.core.api.resources.MappingProvider.configure(MappingProvider.java:53) at de.tudarmstadt.ukp.dkpro.core.opennlp.OpenNlpPosTagger.process(OpenNlpPosTagger.java:172) at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:280) at de.tudarmstadt.lt.jobimify.JoBimificator.process(JoBimificator.java:176) at de.tudarmstadt.lt.jobimify.JoBimificator.processFile(JoBimificator.java:317) at de.tudarmstadt.lt.jobimify.JoBimificator.main(JoBimificator.java:395) Caused by: java.lang.ClassNotFoundException: org.apache.commons.lang.exception.ExceptionUtils at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more

alexanderpanchenko commented 9 years ago

try to process the rest of the file

use head / tail commands to cut the part of the corpus already processed

suhasbhairav commented 9 years ago
sed -i.bak s/\t\t/\t/g corpus.txt
suhasbhairav commented 9 years ago

cut -f 5 u | tr '\n' ' '

alexanderpanchenko commented 9 years ago

Please, lemmatize this corpus (about 59Gb) as described above: frink:/srv/data/en59g/*.txt

Use this corpus for training a model. Please complete the task by next Friday and consider that lemmatization/training will take a lot of time, so start early.

Note: now copying process is in progress, do not start lemmatization unless you have 4 files in the directory of 59 Gb size in total.

alexanderpanchenko commented 9 years ago

update -- the data are now on the server. you can start lemmatization.

suhasbhairav commented 9 years ago

There are four files in the directory en59g. Should I lemmatize them one by one or use *.txt?

suhasbhairav commented 9 years ago

It says permission denied when I tried to run it against *.txt

java.io.FileNotFoundException: /srv/data/en59g/en_news-2014.txt.csv (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:142) at java.io.FileWriter.(FileWriter.java:78) at de.tudarmstadt.lt.jobimify.JoBimificator.processFile(JoBimificator.java:329) at de.tudarmstadt.lt.jobimify.JoBimificator.main(JoBimificator.java:432)

Have I been given the permission to access the files?

alexanderpanchenko commented 9 years ago

this is correct, you shouldn't write anything in this directory, only read specify output in your home with the -o key

suhasbhairav commented 9 years ago

I have copied the same set of files on my directory and started the lemmatization process.

alexanderpanchenko commented 9 years ago

normally you shouln't copy the input files, rather specify the output in your home

suhasbhairav commented 9 years ago

All the processes have stopped due to a lack of space. I got the following error on all the sessions I was running.

I got the following exception: java.io.IOException : No space left on device.

I had to remove the previous corpuses on the disk including the latest ukwack_wacky_corpus that was getting generated because it took more than 120GB of space. Should I begin the process from the beginning or tail the files and attach it to the already generated lemmatized corpus?

suhasbhairav commented 9 years ago

The files have been lemmatized. Let me know once you check the files for its correctness. I'll start training the model. They are in the location : /srv/data/en59g/