idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
601 stars 137 forks source link

Prepare.sh problem #5

Closed munichong closed 9 years ago

munichong commented 9 years ago

Hi, I am trying to run prepare.sh. I am using MAC.

This is how I run it: sudo sh prepare.sh en_US data/ Downloading wiki dump and installing packages, e.g. hadoop and spark, are all fine. Compiling wiki2vec also receives a lot of "SUCCESSFUL". However, I started to receive exceptions when when the program tried to create readable wiki: Creating Readable Wiki.. Exception in thread "main" java.io.FileNotFoundException: data/enwiki-latest.lines (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at java.io.FileOutputStream.(FileOutputStream.java:101) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:30) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala) Creating Word2vec Corpus Spark assembly has been built with Hive, including Datanucleus jars on classpath Path to Readable Wikipedia: file://data//enwiki-latest.lines Path to Wikipedia Redirects: fakePathToRedirect/file.nt Path to Output Corpus : file:///Users/cwang/Downloads/wiki2vec-master/working/enwiki 2015-06-13 00:37:07 INFO SecurityManager:59 - Changing view acls to: root 2015-06-13 00:37:07 INFO SecurityManager:59 - Changing modify acls to: root 2015-06-13 00:37:07 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 2015-06-13 00:37:08 INFO Slf4jLogger:80 - Slf4jLogger started 2015-06-13 00:37:08 INFO Remoting:74 - Starting remoting 2015-06-13 00:37:09 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://sparkDriver@res1cwang-m1.home:53495] 2015-06-13 00:37:09 INFO Utils:59 - Successfully started service 'sparkDriver' on port 53495. 2015-06-13 00:37:09 INFO SparkEnv:59 - Registering MapOutputTracker 2015-06-13 00:37:09 INFO SparkEnv:59 - Registering BlockManagerMaster 2015-06-13 00:37:09 INFO DiskBlockManager:59 - Created local directory at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-local-20150613003709-d48b 2015-06-13 00:37:10 INFO MemoryStore:59 - MemoryStore started with capacity 265.1 MB 2015-06-13 00:37:12 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2015-06-13 00:37:13 INFO HttpFileServer:59 - HTTP File server directory is /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-b6dd5609-bb7d-4b8c-974c-272f3c32fd76 2015-06-13 00:37:13 INFO HttpServer:59 - Starting HTTP Server 2015-06-13 00:37:13 INFO Server:272 - jetty-8.y.z-SNAPSHOT 2015-06-13 00:37:13 INFO AbstractConnector:338 - Started SocketConnector@0.0.0.0:53496 2015-06-13 00:37:13 INFO Utils:59 - Successfully started service 'HTTP file server' on port 53496. 2015-06-13 00:37:13 INFO Server:272 - jetty-8.y.z-SNAPSHOT 2015-06-13 00:37:13 INFO AbstractConnector:338 - Started SelectChannelConnector@0.0.0.0:4040 2015-06-13 00:37:13 INFO Utils:59 - Successfully started service 'SparkUI' on port 4040. 2015-06-13 00:37:13 INFO SparkUI:59 - Started SparkUI at http://res1cwang-m1.home:4040 2015-06-13 00:37:15 INFO SparkContext:59 - Added JAR file:/Users/cwang/Downloads/wiki2vec-master/target/scala-2.10/wiki2vec-assembly-1.0.jar at http://192.168.1.8:53496/jars/wiki2vec-assembly-1.0.jar with timestamp 1434170235402 2015-06-13 00:37:16 INFO AkkaUtils:59 - Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@res1cwang-m1.home:53495/user/HeartbeatReceiver 2015-06-13 00:37:16 INFO NettyBlockTransferService:59 - Server created on 53497 2015-06-13 00:37:16 INFO BlockManagerMaster:59 - Trying to register BlockManager 2015-06-13 00:37:16 INFO BlockManagerMasterActor:59 - Registering block manager localhost:53497 with 265.1 MB RAM, BlockManagerId(, localhost, 53497) 2015-06-13 00:37:16 INFO BlockManagerMaster:59 - Registered BlockManager java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at scala.io.Source$.fromFile(Source.scala:90) at scala.io.Source$.fromFile(Source.scala:75) at scala.io.Source$.fromFile(Source.scala:53) at scala.io.Source$.fromFile(Source.scala:59) at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58) at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34) at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172) at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) using empty redirect store.. 2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(159118) called with curMem=0, maxMem=278019440 2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 265.0 MB) 2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(22692) called with curMem=159118, maxMem=278019440 2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.0 MB) 2015-06-13 00:37:17 INFO BlockManagerInfo:59 - Added broadcast_0_piece0 in memory on localhost:53497 (size: 22.2 KB, free: 265.1 MB) 2015-06-13 00:37:17 INFO BlockManagerMaster:59 - Updated info of block broadcast_0_piece0 2015-06-13 00:37:17 INFO SparkContext:59 - Created broadcast 0 from textFile at Word2VecCorpus.scala:25 2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(40) called with curMem=181810, maxMem=278019440 2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 265.0 MB) 2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(86) called with curMem=181850, maxMem=278019440 2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 86.0 B, free 265.0 MB) 2015-06-13 00:37:17 INFO BlockManagerInfo:59 - Added broadcast_1_piece0 in memory on localhost:53497 (size: 86.0 B, free: 265.1 MB) 2015-06-13 00:37:17 INFO BlockManagerMaster:59 - Updated info of block broadcast_1_piece0 2015-06-13 00:37:17 INFO SparkContext:59 - Created broadcast 1 from broadcast at Word2VecCorpus.scala:30 2015-06-13 00:37:18 INFO deprecation:1009 - mapred.tip.id is deprecated. Instead, use mapreduce.task.id 2015-06-13 00:37:18 INFO deprecation:1009 - mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 2015-06-13 00:37:18 INFO deprecation:1009 - mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 2015-06-13 00:37:18 INFO deprecation:1009 - mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 2015-06-13 00:37:18 INFO deprecation:1009 - mapred.job.id is deprecated. Instead, use mapreduce.job.id Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file://data/enwiki-latest.lines, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397) at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57) at org.apache.hadoop.fs.Globber.glob(Globber.java:248) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1074) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:849) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1164) at org.idio.wikipedia.word2vec.Word2VecCorpus.getWord2vecCorpus(Word2VecCorpus.scala:139) at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:185) at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Joining corpus.. prepare.sh: line 89: data//enwiki.corpus: No such file or directory ^___^ corpus : data//enwiki.corpus

I guessed the reason was that I passed a relative path, not a absolute path. So I commented some lines in prepare.sh because the wikidump and packages were already downloaded. Then, I re-ran it like this: sudo sh prepare.sh en_US /Users/cwang/Downloads/wiki2vec-master/data Below is what I got: Language: en Working directory: /Users/cwang/Downloads/wiki2vec-master/working Creating Readable Wiki.. Exception in thread "main" java.io.FileNotFoundException: /Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at java.io.FileOutputStream.(FileOutputStream.java:101) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:30) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala) Creating Word2vec Corpus Spark assembly has been built with Hive, including Datanucleus jars on classpath Path to Readable Wikipedia: file:///Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines Path to Wikipedia Redirects: fakePathToRedirect/file.nt Path to Output Corpus : file:///Users/cwang/Downloads/wiki2vec-master/working/enwiki 2015-06-13 10:14:03 INFO SecurityManager:59 - Changing view acls to: root 2015-06-13 10:14:03 INFO SecurityManager:59 - Changing modify acls to: root 2015-06-13 10:14:03 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 2015-06-13 10:14:03 INFO Slf4jLogger:80 - Slf4jLogger started 2015-06-13 10:14:03 INFO Remoting:74 - Starting remoting 2015-06-13 10:14:03 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://sparkDriver@res1cwang-m1.home:58893] 2015-06-13 10:14:03 INFO Utils:59 - Successfully started service 'sparkDriver' on port 58893. 2015-06-13 10:14:03 INFO SparkEnv:59 - Registering MapOutputTracker 2015-06-13 10:14:03 INFO SparkEnv:59 - Registering BlockManagerMaster 2015-06-13 10:14:03 INFO DiskBlockManager:59 - Created local directory at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-local-20150613101403-e43b 2015-06-13 10:14:03 INFO MemoryStore:59 - MemoryStore started with capacity 265.1 MB 2015-06-13 10:14:04 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2015-06-13 10:14:04 INFO HttpFileServer:59 - HTTP File server directory is /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-846c5c00-63bc-4070-b1ef-903b6fcd3567 2015-06-13 10:14:04 INFO HttpServer:59 - Starting HTTP Server 2015-06-13 10:14:04 INFO Server:272 - jetty-8.y.z-SNAPSHOT 2015-06-13 10:14:04 INFO AbstractConnector:338 - Started SocketConnector@0.0.0.0:58894 2015-06-13 10:14:04 INFO Utils:59 - Successfully started service 'HTTP file server' on port 58894. 2015-06-13 10:14:04 INFO Server:272 - jetty-8.y.z-SNAPSHOT 2015-06-13 10:14:04 INFO AbstractConnector:338 - Started SelectChannelConnector@0.0.0.0:4040 2015-06-13 10:14:04 INFO Utils:59 - Successfully started service 'SparkUI' on port 4040. 2015-06-13 10:14:04 INFO SparkUI:59 - Started SparkUI at http://res1cwang-m1.home:4040 2015-06-13 10:14:05 INFO SparkContext:59 - Added JAR file:/Users/cwang/Downloads/wiki2vec-master/target/scala-2.10/wiki2vec-assembly-1.0.jar at http://192.168.1.8:58894/jars/wiki2vec-assembly-1.0.jar with timestamp 1434204845068 2015-06-13 10:14:05 INFO AkkaUtils:59 - Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@res1cwang-m1.home:58893/user/HeartbeatReceiver 2015-06-13 10:14:05 INFO NettyBlockTransferService:59 - Server created on 58895 2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Trying to register BlockManager 2015-06-13 10:14:05 INFO BlockManagerMasterActor:59 - Registering block manager localhost:58895 with 265.1 MB RAM, BlockManagerId(, localhost, 58895) 2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Registered BlockManager java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at scala.io.Source$.fromFile(Source.scala:90) at scala.io.Source$.fromFile(Source.scala:75) at scala.io.Source$.fromFile(Source.scala:53) at scala.io.Source$.fromFile(Source.scala:59) at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58) at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34) at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172) at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) using empty redirect store.. 2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(159118) called with curMem=0, maxMem=278019440 2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 265.0 MB) 2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(22692) called with curMem=159118, maxMem=278019440 2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.0 MB) 2015-06-13 10:14:05 INFO BlockManagerInfo:59 - Added broadcast_0_piece0 in memory on localhost:58895 (size: 22.2 KB, free: 265.1 MB) 2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Updated info of block broadcast_0_piece0 2015-06-13 10:14:05 INFO SparkContext:59 - Created broadcast 0 from textFile at Word2VecCorpus.scala:25 2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(40) called with curMem=181810, maxMem=278019440 2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 265.0 MB) 2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(86) called with curMem=181850, maxMem=278019440 2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 86.0 B, free 265.0 MB) 2015-06-13 10:14:05 INFO BlockManagerInfo:59 - Added broadcast_1_piece0 in memory on localhost:58895 (size: 86.0 B, free: 265.1 MB) 2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Updated info of block broadcast_1piece0 2015-06-13 10:14:05 INFO SparkContext:59 - Created broadcast 1 from broadcast at Word2VecCorpus.scala:30 Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/cwang/Downloads/wiki2vec-master/working/enwiki already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1041) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:849) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1164) at org.idio.wikipedia.word2vec.Word2VecCorpus.getWord2vecCorpus(Word2VecCorpus.scala:139) at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:185) at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Joining corpus.. prepare.sh: line 89: /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus: No such file or directory ^**^ corpus : /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus RES1CWANG-M1:wiki2vec-master cwang$ sudo sh prepare.sh en_US /Users/cwang/Downloads/wiki2vec-master/data prepare.sh: line 23: 2: No such file or directory Language: en Working directory: /Users/cwang/Downloads/wiki2vec-master/working Creating Readable Wiki.. Exception in thread "main" java.io.FileNotFoundException: /Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at java.io.FileOutputStream.(FileOutputStream.java:101) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:30) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala) Creating Word2vec Corpus Spark assembly has been built with Hive, including Datanucleus jars on classpath Path to Readable Wikipedia: file:///Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines Path to Wikipedia Redirects: fakePathToRedirect/file.nt Path to Output Corpus : file:///Users/cwang/Downloads/wiki2vec-master/working/enwiki 2015-06-13 10:15:35 INFO SecurityManager:59 - Changing view acls to: root 2015-06-13 10:15:35 INFO SecurityManager:59 - Changing modify acls to: root 2015-06-13 10:15:35 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 2015-06-13 10:15:36 INFO Slf4jLogger:80 - Slf4jLogger started 2015-06-13 10:15:36 INFO Remoting:74 - Starting remoting 2015-06-13 10:15:36 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://sparkDriver@res1cwang-m1.home:58941] 2015-06-13 10:15:36 INFO Utils:59 - Successfully started service 'sparkDriver' on port 58941. 2015-06-13 10:15:36 INFO SparkEnv:59 - Registering MapOutputTracker 2015-06-13 10:15:36 INFO SparkEnv:59 - Registering BlockManagerMaster 2015-06-13 10:15:36 INFO DiskBlockManager:59 - Created local directory at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-local-20150613101536-86b2 2015-06-13 10:15:36 INFO MemoryStore:59 - MemoryStore started with capacity 265.1 MB 2015-06-13 10:15:36 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2015-06-13 10:15:36 INFO HttpFileServer:59 - HTTP File server directory is /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-914594ab-f207-47cf-b092-1bb988a1cd0c 2015-06-13 10:15:36 INFO HttpServer:59 - Starting HTTP Server 2015-06-13 10:15:37 INFO Server:272 - jetty-8.y.z-SNAPSHOT 2015-06-13 10:15:37 INFO AbstractConnector:338 - Started SocketConnector@0.0.0.0:58942 2015-06-13 10:15:37 INFO Utils:59 - Successfully started service 'HTTP file server' on port 58942. 2015-06-13 10:15:37 INFO Server:272 - jetty-8.y.z-SNAPSHOT 2015-06-13 10:15:37 INFO AbstractConnector:338 - Started SelectChannelConnector@0.0.0.0:4040 2015-06-13 10:15:37 INFO Utils:59 - Successfully started service 'SparkUI' on port 4040. 2015-06-13 10:15:37 INFO SparkUI:59 - Started SparkUI at http://res1cwang-m1.home:4040 2015-06-13 10:15:37 INFO SparkContext:59 - Added JAR file:/Users/cwang/Downloads/wiki2vec-master/target/scala-2.10/wiki2vec-assembly-1.0.jar at http://192.168.1.8:58942/jars/wiki2vec-assembly-1.0.jar with timestamp 1434204937819 2015-06-13 10:15:37 INFO AkkaUtils:59 - Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@res1cwang-m1.home:58941/user/HeartbeatReceiver 2015-06-13 10:15:37 INFO NettyBlockTransferService:59 - Server created on 58943 2015-06-13 10:15:37 INFO BlockManagerMaster:59 - Trying to register BlockManager 2015-06-13 10:15:37 INFO BlockManagerMasterActor:59 - Registering block manager localhost:58943 with 265.1 MB RAM, BlockManagerId(, localhost, 58943) 2015-06-13 10:15:37 INFO BlockManagerMaster:59 - Registered BlockManager java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at scala.io.Source$.fromFile(Source.scala:90) at scala.io.Source$.fromFile(Source.scala:75) at scala.io.Source$.fromFile(Source.scala:53) at scala.io.Source$.fromFile(Source.scala:59) at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58) at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34) at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172) at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) using empty redirect store.. 2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(159118) called with curMem=0, maxMem=278019440 2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 265.0 MB) 2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(22692) called with curMem=159118, maxMem=278019440 2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.0 MB) 2015-06-13 10:15:38 INFO BlockManagerInfo:59 - Added broadcast_0_piece0 in memory on localhost:58943 (size: 22.2 KB, free: 265.1 MB) 2015-06-13 10:15:38 INFO BlockManagerMaster:59 - Updated info of block broadcast_0_piece0 2015-06-13 10:15:38 INFO SparkContext:59 - Created broadcast 0 from textFile at Word2VecCorpus.scala:25 2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(40) called with curMem=181810, maxMem=278019440 2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 265.0 MB) 2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(86) called with curMem=181850, maxMem=278019440 2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 86.0 B, free 265.0 MB) 2015-06-13 10:15:38 INFO BlockManagerInfo:59 - Added broadcast_1_piece0 in memory on localhost:58943 (size: 86.0 B, free: 265.1 MB) 2015-06-13 10:15:38 INFO BlockManagerMaster:59 - Updated info of block broadcast_1piece0 2015-06-13 10:15:38 INFO SparkContext:59 - Created broadcast 1 from broadcast at Word2VecCorpus.scala:30 2015-06-13 10:15:38 INFO deprecation:1009 - mapred.tip.id is deprecated. Instead, use mapreduce.task.id 2015-06-13 10:15:38 INFO deprecation:1009 - mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 2015-06-13 10:15:38 INFO deprecation:1009 - mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 2015-06-13 10:15:38 INFO deprecation:1009 - mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 2015-06-13 10:15:38 INFO deprecation:1009 - mapred.job.id is deprecated. Instead, use mapreduce.job.id Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1074) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:849) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1164) at org.idio.wikipedia.word2vec.Word2VecCorpus.getWord2vecCorpus(Word2VecCorpus.scala:139) at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:185) at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Joining corpus.. prepare.sh: line 88: /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus: No such file or directory ^**^ corpus : /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus

There is no data folder in my wiki2vec-master folder. Could you please advise how I can fix it?

Thanks in advance!

dav009 commented 9 years ago

it seems is trying to look for the output here : file://data/enwiki-latest.lines could you try giving it a complete path i.e: /User/aaa/data/ ?

dav009 commented 9 years ago

I just tried it on a mac, it seems you have to :

munichong commented 9 years ago

Thanks for your reply.

I have created a data folder and made sure /User/aaa/data exist. Then, ran it again. I still received an exception as below:

RES1CWANG-M1:wiki2vec-master cwang$ sudo sh prepare.sh en_US /Users/cwang/Downloads/wiki2vec-master/data

prepare.sh: line 23: 2: No such file or directory Language: en Working directory: /Users/cwang/Downloads/wiki2vec-master/working Creating Readable Wiki.. 1 2 3 ...... 195 196 197 Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 18002 at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.getAndMoveToFrontDecode(BZip2CompressorInputStream.java:641) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:326) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:884) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:933) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:228) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:179) at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source) at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.skipChar(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala) Creating Word2vec Corpus ...... ...... 2015-06-14 18:07:31 INFO BlockManagerMaster:59 - Registered BlockManager java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at scala.io.Source$.fromFile(Source.scala:90) at scala.io.Source$.fromFile(Source.scala:75) at scala.io.Source$.fromFile(Source.scala:53) at scala.io.Source$.fromFile(Source.scala:59) at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58) at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34) at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172) at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) using empty redirect store.. ......

dav009 commented 9 years ago

mm giving it a try right now with the english wikipedia. But It seems from the error that the downloaded wiki dump could be corrupted.

dav009 commented 9 years ago

Gave it a try with en_US and it went alright, I wonder if you could check the md5 of the downloaded wikipedia dump.

munichong commented 9 years ago

I was using the latest English dump. I didn't modify it after it is downloaded. I just download the English dump generated on 20141208. It seems it works, although it is still running. I think you are right that the latest dumps have some problems.

BTW, is the pre-built model built based on full wiki pages? What is the difference between the pre-built model and the models built by "quick usage"?

On Jun 15, 2015, at 02:58, David Przybilla notifications@github.com wrote:

mm giving it a try right now with the english wikipedia. But It seems from the error that the downloaded wiki dump could be corrupted.

— Reply to this email directly or view it on GitHub.

dav009 commented 9 years ago

There is no difference.

munichong commented 9 years ago

What is the threshold of minimum word frequency of the prebuilt model?

dav009 commented 9 years ago

50, this is due to a limitation on gensim word2vec implementation. this affects the number of entities that you get in the vectors.

munichong commented 9 years ago

OK. I plan to set it to 5.

dav009 commented 9 years ago

most likely you will need lots of memory(probably more than 60GB), how many ram do you have?

also are you setting it to 5 cause you want to have more coverage of entities?

munichong commented 9 years ago

Oh. OK... I only have 16GB... I just tried the prebuilt model. It does not cover words like "verisign", which actually has a wikipage and the words occurs 86 times on the page. A little strange. Did I miss something? https://en.wikipedia.org/wiki/Verisign

dav009 commented 9 years ago

Just had a check on the models and found the following vectors:

munichong commented 9 years ago

Oh. I think it is the case-sensitive thing:

model.similarity('Verisign','google') 0.079122956488234003 model.similarity('verisign','google') Traceback (most recent call last): File "", line 1, in File "/Users/cwang/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 783, in similarity return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2])) File "/Users/cwang/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 763, in getitem return self.syn0[self.vocab[word].index] KeyError: 'verisign'

In my application, all words are in lower case. Is there any way that I can have a case-insensitive model?

dav009 commented 9 years ago

Out of the box there is no option. There is an option for outputting stems.

It should be easy, you just need to make the text in enwiki.corpus lowercase.

dav009 commented 9 years ago

@munichong I was using the most recent dump until yesterday(enwiki-20150403-pages-articles-multistream.xml.bz2) for other task and it is definitely corrupted, there is no even md5 hash for it, and there is a failure message on the dump download page.

Today they published June's dump which seems to be sane.

munichong commented 9 years ago

I have made the text lowercase and trained a Word2Vec model based on an old dump. It looks fine. Thanks for your help!

BTW, one parameter of wiki2vec.sh stated in readme is a little misleading. I think "PathToOutputFile" is more appropriate than "PathToOutputFolder". I received "IOError: [Errno 21] Is a directory" at the first time.

dav009 commented 9 years ago

Feel free to make a PR changing the confusing parts of the readme