Comparing original and new trigram JoBimText (JBT) implementations

alexanderpanchenko commented 9 years ago

General motivation

Computational lexical semantics is a subfield of Natural Language Processing that studies computational models of lexical items, such as words, noun phrases and multiword expressions. Modelling semantic relations between words (e.g. synonyms) and word senses (e.g. “python” as programming language vs “python” as snake) is of practical interest in context of various language processing and information retrieval applications.

During the last 20 years, several accurate computational models of lexical semantics emerged, such as distributional semantics (Biemann, 2013; Baroni, 2011) and word embeddings (Mikolov, 2013). In this thesis, you will deal with one of the state-of-the-art approaches to lexical semantics, developed at TU Darmstadt, called JoBimText: http://jobimtext.org. According to multiple evaluations, the JoBimText approach yields cutting edge accuracy on such tasks as semantic relatedness (Biemann, 2013). Besides, it also enables features missing in other frameworks, such as automatic sense discovery.

Current implementation of the JoBimText let us process text corpora up to 50 Gb on a mid-sized Hadoop cluster of 400 cores and 50 Tb of HDFS. Your goal will be to re-engineer the system in such a way that it is able to process text corpora up to 5 Tb (100 times bigger) on the same cluster. This goal will be achieved by using the modern Apache Spark framework for distributed computation that allows a user to dump to temporary files to disk and thus implement incremental algorithms more efficiently.

The ultimate goal of the project will be to develop a system that will be able to compute a distributional thesaurus from the Common Crawl corpus (541TB dataset on Amazon AWS):

This is supposed to be the biggest experiment in distributional semantics conducted so far. This will be in line with this initiative: http://www.webdatacommons.org/. Read this thesis for reference on a similar project: thesis.pdf

Motivation of the initial experiment

Initial experiment is needed to make a proof-of-concept and show feasibility of results. In this experiment, you will work with trigram holing JoBimText (JBT) approach to construction of distributional thesaurus (DT). The goal of the experiment is to:

Ensure by extensive testing that the new (Spark) implementation provides the same outputs as ths original (MapReduce) implementation.
Measure and compare performance of the original and the new implementations.
Implementation of the initial experiment
Download the corpus. Download Wikipedia corpus: http://cental.fltr.ucl.ac.be/team/~panchenko/data/corpora/wacky-surface.csv. For testing purposes make a subcorpus of 50Mb and 500Mb. First do all the experiments on these smaller chunks. Then proceed with the entire 5Gb dataset. All experiments are conducted locally on your machine.
Compute a trigram DT with the original JBT implementation.
- https://sites.google.com/site/jobimtexttutorial/ -- extensive tutorial on how to run the original system
- http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/wordpress/wp-content/uploads/2014/04/JoBimText-Tutorial-Practice-Commands.txt
- http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/
- to generate the bootstrap script:
```
python generateHadoopScript.py -q shortrunning -hl trigram -nb corpora/en/wikipedia_eugen -f 5 -w 5 -wf 2
```
Get the DT from the outputs of the original pipeline. Here is description of the output formats: http://panchenko.me/jbt/
Compute the same DT with the new pipeline (use also the trigram holing). Make sure to use exactly the same parameters! Follow instuctions here: https://github.com/tudarmstadt-lt/noun-sense-induction-scala. Use this script to get parameters of the trigram holing without lemmatization: https://github.com/tudarmstadt-lt/noun-sense-induction-scala/blob/master/scripts/run-nsi-trigram-nolemma.sh
Create a table in a Google Docs with comparison of the original and the new DT outputs. Rows -- runs. Colums are the following measurements:
- size of the input corpus, MB
- number of words in DT: cat dt.csv | cut -f 1 | sort | uniq | wc -l
- number of relations in DT
- overlap of relations, percent
- size of DT in MB
- DT computation time in seconds one core:time
- output size of all files in MB
- memory consumed in MB
Put online results of the experiments of both pipelines e.g. Google Drive.
Write a report including the table above.
Write outline of the thesis. Add references e.g. Spark books and master theses liste below.
References
- Biemann, Chris, and Martin Riedl. "Text: Now in 2D! a framework for lexical expansion with contextual similarity." Journal of Language Modelling 1.1 (2013): 55-95.
- Baroni, Marco, and Alessandro Lenci. "Distributional memory: A general framework for corpusbased semantics." Computational Linguistics 36.4 (2010): 673721.
- Ruppert, Eugen, Manuel Kaufmann, Martin Riedl, and Chris Biemann. "JoBimViz: A Web-based Visualization for Graph-based Distributional Semantic Models." ACL-IJCNLP 2015 (2015): 103.
- Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space.". 1301.3781 (2013).
- Julian Felix Maria Seitner. The Web Tuples Database: A Large-scale Resource of hyponymy Relations Master Thesis.
- Johannes Simon. Word Sense Induction and Disambiguation using Distributional Semantics

mmbspk commented 8 years ago

I tried to run both the default and wiki test corpus on the VM provided but both are not running. PFA the screenshots of log file. https://www.dropbox.com/sh/qy6s9cp1e68j7vb/AABsrN55Pgaw4wQvr5sSo5lUa?dl=0 Can you please provide some link to setup the original system on my own OS rather than using VM.

alexanderpanchenko commented 8 years ago

please copy all the logs in the textual format.

yes, you can install the system w/o VM, just install hadoop in local or pseudo-distributed mode on your local machine and run the JoBimText jars from there. you can still copy the files from the VM for a test

2015-11-29 13:58 GMT+01:00 Mian Muhammad Bilal Shabbir < notifications@github.com>:

I tried to run both the default and wiki test corpus on the VM provided but both are not running. PFA the screenshots of log file. https://www.dropbox.com/sh/qy6s9cp1e68j7vb/AABsrN55Pgaw4wQvr5sSo5lUa?dl=0 Can you please provide some link to setup the original system on my own OS rather than using VM.

— Reply to this email directly or view it on GitHub https://github.com/tudarmstadt-lt/noun-sense-induction-scala/issues/3#issuecomment-160415456 .

mmbspk commented 8 years ago

Please find the log files here: https://www.dropbox.com/sh/qy6s9cp1e68j7vb/AABsrN55Pgaw4wQvr5sSo5lUa?dl=0

From what i understand its because of running it on local/single system rather then cluster, as the grunt related code is specific to cluster.

I installed the system to my local but still its not running. I am trying to complete the required things to run on local, also trying both license releases provided with different holing operations.

alexanderpanchenko commented 8 years ago

Hi,

logs makes no sense to me. how do you try to run the system? Strange that the VM doesn't work. Do you follow instruction closely? Double check this please. If the original JBT doesn't work try first the new version. It must work normally locally. If still doesn't work, let us meet and look into the problem.

mmbspk commented 8 years ago

The VM works fine now with 1MB and 5MB for trigram, but when i move to 50MB it crashes with the following logs and after that running fresh tries its not even generating logs. You can see the image name 'VisibleErrorwith50MB.jpg' with the last screen error.

https://www.dropbox.com/sh/p0c14ihl2igz9sm/AABC9Nx_pUNhA1RQPmjvxyDTa?dl=0

It ran and creates the corpus_trigram folder in HDFS and filled it with around 161 MB of data, but for further tasks it crashed without generating any logs.

For the current version it compiles fine but when it runs the class it throws the illegal argument exception, i tried almost all combinations but it doesn't seems to work. Last time i ran it from eclipse.

$: bash mvn-hadoop de.tudarmstadt.lt.wsi.JoBimExtractAndCount -Dmapreduce.map.memory.mb=4096 -Dmapreduce.task.io.sort.mb=1028 -Dmapreduce.local.map.tasks.maximum=4 -Dholing.dependencies=true -Dholing.coocs=true -Dmapred.max.split.size=1000000 "/corpus" "/output" [INFO] Scanning for projects... [INFO]
[INFO] ------------------------------------------------------------------------ [INFO] Building noun-sense-induction 0.0.1-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-resources-plugin:2.3:resources (default-resources) @ noun-sense-induction --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 3 resources [INFO] [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ noun-sense-induction --- [INFO] Nothing to compile - all classes are up to date [INFO] [INFO] --- maven-resources-plugin:2.3:resources (default-resources) @ noun-sense-induction --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 3 resources [INFO] [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ noun-sense-induction --- [INFO] Nothing to compile - all classes are up to date [INFO] [INFO] --- maven-resources-plugin:2.3:testResources (default-testResources) @ noun-sense-induction --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 1 resource [INFO] [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ noun-sense-induction --- [INFO] No sources to compile [INFO] [INFO] --- maven-surefire-plugin:2.14:test (default-test) @ noun-sense-induction --- [INFO] Tests are skipped. [INFO] [INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ noun-sense-induction --- [INFO] [INFO] --- maven-dependency-plugin:2.8:build-classpath (default-cli) @ noun-sense-induction --- [INFO] Skipped writing classpath file '/home/mmbspk/noun-sense-induction-master/.dependency-jars'. No changes found. [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 2.282 s [INFO] Finished at: 2015-12-09T13:19:52+01:00 [INFO] Final Memory: 20M/318M [INFO] ------------------------------------------------------------------------ Exception in thread "main" java.lang.IllegalArgumentException: File name can't be empty string at org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:390) at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:299) at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:487) at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:170) at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64) at de.tudarmstadt.lt.wsi.JoBimExtractAndCount.main(JoBimExtractAndCount.java:73)

alexanderpanchenko commented 8 years ago

regarding the VM: suspend for now this and focus on the newer version

switch from eclipse to IntelliJ idea for both Java and Scala projects. Import maven/sbt projects. For me it just runs fine like this. To get an ultimate version of Idea register with you tu-darmstadt email.

mmbspk commented 8 years ago

After setting up all paths and stuff, i ran the script and i am getting the yarn exception related to unknown queue. Following are the console messages. From what i have looked online i think i have to put the queues in the configuration file.

/usr/bin/env bash /home/mmbspk/noun-sense-induction-scala-master/scripts/run-nsi-trigram-nolemma.sh corpus WSI_OUT/wordsim true true shortrunning Corpus: corpus Output: WSI_OUT/wordsim Features: WSI_OUT/wordsim/Holing-trigram_Lemmatize-false_Coocs-false_MaxLen-110_NounsOnly-false_NounNounOnly-false_Semantify-true Similarities: WSI_OUT/wordsim/Holing-trigram_Lemmatize-false_Coocs-false_MaxLen-110_NounsOnly-false_NounNounOnly-false_Semantify-true__Significance-LMI_WordsPerFeature-1000_FeaturesPerWord-1000_MinWordFreq-5_MinFeatureFreq-5_MinWordFeatureFreq-2_MinFeatureSignif-0.0_SimPrecision-5_NearestNeighboursNum-200 Calculate features: true Features exist: false Calculate similarities: true Similarity exist: false To start press any key, to stop press Ctrl+C

args:[corpus, WSI_OUT/wordsim/Holing-trigram_Lemmatize-false_Coocs-false_MaxLen-110_NounsOnly-false_NounNounOnly-false_Semantify-true] 15/12/12 19:11:00 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 15/12/12 19:11:00 INFO Configuration.deprecation: mapred.output.compression.codec is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.codec 15/12/12 19:11:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 15/12/12 19:11:00 INFO input.FileInputFormat: Total input paths to process : 1 15/12/12 19:11:01 INFO mapreduce.JobSubmitter: number of splits:52 15/12/12 19:11:01 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 15/12/12 19:11:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1449924685125_0009 15/12/12 19:11:01 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/mmbspk/.staging/job_1449924685125_0009 Exception in thread "main" java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1449924685125_0009 to YARN : Application application_1449924685125_0009 submitted by user mmbspk to unknown queue: shortrunning at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:306) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:243) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) at de.tudarmstadt.lt.wsi.JoBimExtractAndCount.runJob(JoBimExtractAndCount.java:56) at de.tudarmstadt.lt.wsi.JoBimExtractAndCount.run(JoBimExtractAndCount.java:67) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at de.tudarmstadt.lt.wsi.JoBimExtractAndCount.main(JoBimExtractAndCount.java:73) Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1449924685125_0009 to YARN : Application application_1449924685125_0009 submitted by user mmbspk to unknown queue: shortrunning at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:270) at org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:290) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:290) ... 12 more log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/12/12 19:11:04 INFO Client: Requesting a new application from cluster with 1 NodeManagers 15/12/12 19:11:04 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 15/12/12 19:11:04 INFO Client: Will allocate AM container, with 7884 MB memory including 716 MB overhead 15/12/12 19:11:04 INFO Client: Setting up container launch context for our AM 15/12/12 19:11:04 INFO Client: Setting up the launch environment for our AM container 15/12/12 19:11:04 INFO Client: Preparing resources for our AM container 15/12/12 19:11:04 INFO Client: Source and destination file systems are the same. Not copying file:/usr/local/spark/lib/spark-assembly-1.5.2-hadoop2.6.0.jar 15/12/12 19:11:04 INFO Client: Source and destination file systems are the same. Not copying file:/home/mmbspk/bin-spark/nsi_2.10-0.0.1.jar 15/12/12 19:11:04 INFO Client: Source and destination file systems are the same. Not copying file:/tmp/spark-384037c8-4a52-4d11-9f32-ecf577b24ae6/__spark_conf__3954583964329542407.zip 15/12/12 19:11:05 INFO SecurityManager: Changing view acls to: mmbspk 15/12/12 19:11:05 INFO SecurityManager: Changing modify acls to: mmbspk 15/12/12 19:11:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mmbspk); users with modify permissions: Set(mmbspk) 15/12/12 19:11:05 INFO Client: Submitting application 10 to ResourceManager 15/12/12 19:11:05 INFO YarnClientImpl: Submitted application application_1449924685125_0010 15/12/12 19:11:06 INFO Client: Application report for application_1449924685125_0010 (state: FAILED) 15/12/12 19:11:06 INFO Client: client token: N/A diagnostics: Application application_1449924685125_0010 submitted by user mmbspk to unknown queue: shortrunning ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: shortrunning start time: 1449943865318 final status: FAILED tracking URL: http://mmbspkU:8088/proxy/application_1449924685125_0010/ user: mmbspk Exception in thread "main" org.apache.spark.SparkException: Application application_1449924685125_0010 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:925) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:971) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 15/12/12 19:11:06 INFO ShutdownHookManager: Shutdown hook called 15/12/12 19:11:06 INFO ShutdownHookManager: Deleting directory /tmp/spark-384037c8-4a52-4d11-9f32-ecf577b24ae6

Process finished with exit code 1

alexanderpanchenko commented 8 years ago

Let us meet next week and fix this: next friday at 17:00 in my office (sorry I am very busy next week). Briefly: do not use the run-nsi-trigram-nolemma.sh as it has parameters specitic to the cluster. you need rather to run directly the Java/Scala classes (see in the bash script) directly from CLI or IntellJ.

e.g. like this for spark: -Dspark.master=local[4] -Xms3G -Xmx6G -Dspark.executor.memory=6G

alexanderpanchenko commented 8 years ago

Exception in thread "main" java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1449924685125_0009 to YARN : Application application_1449924685125_0009 submitted by user mmbspk to unknown queue: shortrunning

this is the problem -- notion of queues is only exist in our cluster -- remove this stuff

mmbspk commented 8 years ago

Following is the link to document, i have ran both the phases and put the results in the document. Still missing few calculation. I will also try to run the original JoBimtext version on my local machine. We will meet this Friday, if possible can you make it a bit early!.

https://docs.google.com/spreadsheets/d/1bSzEFxRSddkooQ2O0kj7CV28yHZP0m76MlzMy5HKHF0/edit?usp=sharing

alexanderpanchenko commented 8 years ago

we can make it 10:00

alexanderpanchenko commented 8 years ago

thank you

alexanderpanchenko commented 8 years ago

fill the table

cat SimPruned/p*| cut -f 1 | sort | uniq | wc -l
cat SimPruned/p*|   wc -l
cat SimPruned/p* | cut -f 1,2 | less

make some plots out of the table

alexanderpanchenko commented 8 years ago

Apache Flink

MPI

Cuda

mmbspk commented 8 years ago

I ran and got results on the cluster for Original Jobimtext. I am a bit confused as i just ran the code i had without changing anything, how would i know that how many cores and RAM it used. Also is it somewhere in the code where i can change the resources setting so i would be running both the Original and new systems on the same no. of resources. If i do the same for New system, is there spark installed on cluster?

alexanderpanchenko commented 8 years ago

hello, sorry for the late response. check docs of hadoop yarn to get information about consumed resources by a job. also please try to access web interface as it provides much extra information. you will need to do some port forwarding to open it though.

uhh-lt / josimtext

Comparing original and new trigram JoBimText (JBT) implementations #3

General motivation

Motivation of the initial experiment

Implementation of the initial experiment

References