Closed munichong closed 9 years ago
it seems is trying to look for the output here : file://data/enwiki-latest.lines
could you try giving it a complete path i.e: /User/aaa/data/
?
I just tried it on a mac, it seems you have to :
/Users/aa/data
)/Users/aa/data
, data
folder exists )Thanks for your reply.
I have created a data folder and made sure /User/aaa/data exist. Then, ran it again. I still received an exception as below:
RES1CWANG-M1:wiki2vec-master cwang$ sudo sh prepare.sh en_US /Users/cwang/Downloads/wiki2vec-master/data
prepare.sh: line 23: 2: No such file or directory
Language: en
Working directory: /Users/cwang/Downloads/wiki2vec-master/working
Creating Readable Wiki..
1
2
3
......
195
196
197
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 18002
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.getAndMoveToFrontDecode(BZip2CompressorInputStream.java:641)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:326)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:884)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:933)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:228)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:179)
at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.skipChar(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
......
......
2015-06-14 18:07:31 INFO BlockManagerMaster:59 - Registered BlockManager
java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.
mm giving it a try right now with the english wikipedia. But It seems from the error that the downloaded wiki dump could be corrupted.
Gave it a try with en_US
and it went alright, I wonder if you could check the md5 of the downloaded wikipedia dump.
I was using the latest English dump. I didn't modify it after it is downloaded. I just download the English dump generated on 20141208. It seems it works, although it is still running. I think you are right that the latest dumps have some problems.
BTW, is the pre-built model built based on full wiki pages? What is the difference between the pre-built model and the models built by "quick usage"?
On Jun 15, 2015, at 02:58, David Przybilla notifications@github.com wrote:
mm giving it a try right now with the english wikipedia. But It seems from the error that the downloaded wiki dump could be corrupted.
— Reply to this email directly or view it on GitHub.
There is no difference.
What is the threshold of minimum word frequency of the prebuilt model?
50
, this is due to a limitation on gensim word2vec implementation. this affects the number of entities that you get in the vectors.
OK. I plan to set it to 5.
most likely you will need lots of memory(probably more than 60GB), how many ram do you have?
also are you setting it to 5 cause you want to have more coverage of entities?
Oh. OK... I only have 16GB... I just tried the prebuilt model. It does not cover words like "verisign", which actually has a wikipage and the words occurs 86 times on the page. A little strange. Did I miss something? https://en.wikipedia.org/wiki/Verisign
Just had a check on the models and found the following vectors:
DBPEDIA_ID/Verisign
(topic) Verisign
(word)Oh. I think it is the case-sensitive thing:
model.similarity('Verisign','google') 0.079122956488234003 model.similarity('verisign','google') Traceback (most recent call last): File "
", line 1, in File "/Users/cwang/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 783, in similarity return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2])) File "/Users/cwang/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py", line 763, in getitem return self.syn0[self.vocab[word].index] KeyError: 'verisign'
In my application, all words are in lower case. Is there any way that I can have a case-insensitive model?
Out of the box there is no option. There is an option for outputting stems.
It should be easy, you just need to make the text in enwiki.corpus
lowercase.
@munichong I was using the most recent dump until yesterday(enwiki-20150403-pages-articles-multistream.xml.bz2
) for other task and it is definitely corrupted, there is no even md5 hash for it, and there is a failure message on the dump download page.
Today they published June's dump which seems to be sane.
I have made the text lowercase and trained a Word2Vec model based on an old dump. It looks fine. Thanks for your help!
BTW, one parameter of wiki2vec.sh stated in readme is a little misleading. I think "PathToOutputFile" is more appropriate than "PathToOutputFolder". I received "IOError: [Errno 21] Is a directory" at the first time.
Feel free to make a PR changing the confusing parts of the readme
Hi, I am trying to run prepare.sh. I am using MAC.
This is how I run it: sudo sh prepare.sh en_US data/ Downloading wiki dump and installing packages, e.g. hadoop and spark, are all fine. Compiling wiki2vec also receives a lot of "SUCCESSFUL". However, I started to receive exceptions when when the program tried to create readable wiki: Creating Readable Wiki.. Exception in thread "main" java.io.FileNotFoundException: data/enwiki-latest.lines (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:101)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:30)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Path to Readable Wikipedia: file://data//enwiki-latest.lines
Path to Wikipedia Redirects: fakePathToRedirect/file.nt
Path to Output Corpus : file:///Users/cwang/Downloads/wiki2vec-master/working/enwiki
2015-06-13 00:37:07 INFO SecurityManager:59 - Changing view acls to: root
2015-06-13 00:37:07 INFO SecurityManager:59 - Changing modify acls to: root
2015-06-13 00:37:07 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
2015-06-13 00:37:08 INFO Slf4jLogger:80 - Slf4jLogger started
2015-06-13 00:37:08 INFO Remoting:74 - Starting remoting
2015-06-13 00:37:09 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://sparkDriver@res1cwang-m1.home:53495]
2015-06-13 00:37:09 INFO Utils:59 - Successfully started service 'sparkDriver' on port 53495.
2015-06-13 00:37:09 INFO SparkEnv:59 - Registering MapOutputTracker
2015-06-13 00:37:09 INFO SparkEnv:59 - Registering BlockManagerMaster
2015-06-13 00:37:09 INFO DiskBlockManager:59 - Created local directory at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-local-20150613003709-d48b
2015-06-13 00:37:10 INFO MemoryStore:59 - MemoryStore started with capacity 265.1 MB
2015-06-13 00:37:12 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-06-13 00:37:13 INFO HttpFileServer:59 - HTTP File server directory is /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-b6dd5609-bb7d-4b8c-974c-272f3c32fd76
2015-06-13 00:37:13 INFO HttpServer:59 - Starting HTTP Server
2015-06-13 00:37:13 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 00:37:13 INFO AbstractConnector:338 - Started SocketConnector@0.0.0.0:53496
2015-06-13 00:37:13 INFO Utils:59 - Successfully started service 'HTTP file server' on port 53496.
2015-06-13 00:37:13 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 00:37:13 INFO AbstractConnector:338 - Started SelectChannelConnector@0.0.0.0:4040
2015-06-13 00:37:13 INFO Utils:59 - Successfully started service 'SparkUI' on port 4040.
2015-06-13 00:37:13 INFO SparkUI:59 - Started SparkUI at http://res1cwang-m1.home:4040
2015-06-13 00:37:15 INFO SparkContext:59 - Added JAR file:/Users/cwang/Downloads/wiki2vec-master/target/scala-2.10/wiki2vec-assembly-1.0.jar at http://192.168.1.8:53496/jars/wiki2vec-assembly-1.0.jar with timestamp 1434170235402
2015-06-13 00:37:16 INFO AkkaUtils:59 - Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@res1cwang-m1.home:53495/user/HeartbeatReceiver
2015-06-13 00:37:16 INFO NettyBlockTransferService:59 - Server created on 53497
2015-06-13 00:37:16 INFO BlockManagerMaster:59 - Trying to register BlockManager
2015-06-13 00:37:16 INFO BlockManagerMasterActor:59 - Registering block manager localhost:53497 with 265.1 MB RAM, BlockManagerId(, localhost, 53497)
2015-06-13 00:37:16 INFO BlockManagerMaster:59 - Registered BlockManager
java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:90)
at scala.io.Source$.fromFile(Source.scala:75)
at scala.io.Source$.fromFile(Source.scala:53)
at scala.io.Source$.fromFile(Source.scala:59)
at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58)
at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
using empty redirect store..
2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(159118) called with curMem=0, maxMem=278019440
2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 265.0 MB)
2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(22692) called with curMem=159118, maxMem=278019440
2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.0 MB)
2015-06-13 00:37:17 INFO BlockManagerInfo:59 - Added broadcast_0_piece0 in memory on localhost:53497 (size: 22.2 KB, free: 265.1 MB)
2015-06-13 00:37:17 INFO BlockManagerMaster:59 - Updated info of block broadcast_0_piece0
2015-06-13 00:37:17 INFO SparkContext:59 - Created broadcast 0 from textFile at Word2VecCorpus.scala:25
2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(40) called with curMem=181810, maxMem=278019440
2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 265.0 MB)
2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(86) called with curMem=181850, maxMem=278019440
2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 86.0 B, free 265.0 MB)
2015-06-13 00:37:17 INFO BlockManagerInfo:59 - Added broadcast_1_piece0 in memory on localhost:53497 (size: 86.0 B, free: 265.1 MB)
2015-06-13 00:37:17 INFO BlockManagerMaster:59 - Updated info of block broadcast_1_piece0
2015-06-13 00:37:17 INFO SparkContext:59 - Created broadcast 1 from broadcast at Word2VecCorpus.scala:30
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.tip.id is deprecated. Instead, use mapreduce.task.id
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.job.id is deprecated. Instead, use mapreduce.job.id
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file://data/enwiki-latest.lines, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1074)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:849)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1164)
at org.idio.wikipedia.word2vec.Word2VecCorpus.getWord2vecCorpus(Word2VecCorpus.scala:139)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:185)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Joining corpus..
prepare.sh: line 89: data//enwiki.corpus: No such file or directory
^___^ corpus : data//enwiki.corpus
I guessed the reason was that I passed a relative path, not a absolute path. So I commented some lines in prepare.sh because the wikidump and packages were already downloaded. Then, I re-ran it like this: sudo sh prepare.sh en_US /Users/cwang/Downloads/wiki2vec-master/data Below is what I got: Language: en Working directory: /Users/cwang/Downloads/wiki2vec-master/working Creating Readable Wiki.. Exception in thread "main" java.io.FileNotFoundException: /Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:101)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:30)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Path to Readable Wikipedia: file:///Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines
Path to Wikipedia Redirects: fakePathToRedirect/file.nt
Path to Output Corpus : file:///Users/cwang/Downloads/wiki2vec-master/working/enwiki
2015-06-13 10:14:03 INFO SecurityManager:59 - Changing view acls to: root
2015-06-13 10:14:03 INFO SecurityManager:59 - Changing modify acls to: root
2015-06-13 10:14:03 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
2015-06-13 10:14:03 INFO Slf4jLogger:80 - Slf4jLogger started
2015-06-13 10:14:03 INFO Remoting:74 - Starting remoting
2015-06-13 10:14:03 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://sparkDriver@res1cwang-m1.home:58893]
2015-06-13 10:14:03 INFO Utils:59 - Successfully started service 'sparkDriver' on port 58893.
2015-06-13 10:14:03 INFO SparkEnv:59 - Registering MapOutputTracker
2015-06-13 10:14:03 INFO SparkEnv:59 - Registering BlockManagerMaster
2015-06-13 10:14:03 INFO DiskBlockManager:59 - Created local directory at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-local-20150613101403-e43b
2015-06-13 10:14:03 INFO MemoryStore:59 - MemoryStore started with capacity 265.1 MB
2015-06-13 10:14:04 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-06-13 10:14:04 INFO HttpFileServer:59 - HTTP File server directory is /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-846c5c00-63bc-4070-b1ef-903b6fcd3567
2015-06-13 10:14:04 INFO HttpServer:59 - Starting HTTP Server
2015-06-13 10:14:04 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 10:14:04 INFO AbstractConnector:338 - Started SocketConnector@0.0.0.0:58894
2015-06-13 10:14:04 INFO Utils:59 - Successfully started service 'HTTP file server' on port 58894.
2015-06-13 10:14:04 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 10:14:04 INFO AbstractConnector:338 - Started SelectChannelConnector@0.0.0.0:4040
2015-06-13 10:14:04 INFO Utils:59 - Successfully started service 'SparkUI' on port 4040.
2015-06-13 10:14:04 INFO SparkUI:59 - Started SparkUI at http://res1cwang-m1.home:4040
2015-06-13 10:14:05 INFO SparkContext:59 - Added JAR file:/Users/cwang/Downloads/wiki2vec-master/target/scala-2.10/wiki2vec-assembly-1.0.jar at http://192.168.1.8:58894/jars/wiki2vec-assembly-1.0.jar with timestamp 1434204845068
2015-06-13 10:14:05 INFO AkkaUtils:59 - Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@res1cwang-m1.home:58893/user/HeartbeatReceiver
2015-06-13 10:14:05 INFO NettyBlockTransferService:59 - Server created on 58895
2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Trying to register BlockManager
2015-06-13 10:14:05 INFO BlockManagerMasterActor:59 - Registering block manager localhost:58895 with 265.1 MB RAM, BlockManagerId(, localhost, 58895)
2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Registered BlockManager
java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:90)
at scala.io.Source$.fromFile(Source.scala:75)
at scala.io.Source$.fromFile(Source.scala:53)
at scala.io.Source$.fromFile(Source.scala:59)
at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58)
at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
using empty redirect store..
2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(159118) called with curMem=0, maxMem=278019440
2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 265.0 MB)
2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(22692) called with curMem=159118, maxMem=278019440
2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.0 MB)
2015-06-13 10:14:05 INFO BlockManagerInfo:59 - Added broadcast_0_piece0 in memory on localhost:58895 (size: 22.2 KB, free: 265.1 MB)
2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Updated info of block broadcast_0_piece0
2015-06-13 10:14:05 INFO SparkContext:59 - Created broadcast 0 from textFile at Word2VecCorpus.scala:25
2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(40) called with curMem=181810, maxMem=278019440
2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 265.0 MB)
2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(86) called with curMem=181850, maxMem=278019440
2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 86.0 B, free 265.0 MB)
2015-06-13 10:14:05 INFO BlockManagerInfo:59 - Added broadcast_1_piece0 in memory on localhost:58895 (size: 86.0 B, free: 265.1 MB)
2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Updated info of block broadcast_1piece0
2015-06-13 10:14:05 INFO SparkContext:59 - Created broadcast 1 from broadcast at Word2VecCorpus.scala:30
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/cwang/Downloads/wiki2vec-master/working/enwiki already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1041)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:849)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1164)
at org.idio.wikipedia.word2vec.Word2VecCorpus.getWord2vecCorpus(Word2VecCorpus.scala:139)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:185)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Joining corpus..
prepare.sh: line 89: /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus: No such file or directory
^**^ corpus : /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus
RES1CWANG-M1:wiki2vec-master cwang$ sudo sh prepare.sh en_US /Users/cwang/Downloads/wiki2vec-master/data
prepare.sh: line 23: 2: No such file or directory
Language: en
Working directory: /Users/cwang/Downloads/wiki2vec-master/working
Creating Readable Wiki..
Exception in thread "main" java.io.FileNotFoundException: /Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:101)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:30)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Path to Readable Wikipedia: file:///Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines
Path to Wikipedia Redirects: fakePathToRedirect/file.nt
Path to Output Corpus : file:///Users/cwang/Downloads/wiki2vec-master/working/enwiki
2015-06-13 10:15:35 INFO SecurityManager:59 - Changing view acls to: root
2015-06-13 10:15:35 INFO SecurityManager:59 - Changing modify acls to: root
2015-06-13 10:15:35 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
2015-06-13 10:15:36 INFO Slf4jLogger:80 - Slf4jLogger started
2015-06-13 10:15:36 INFO Remoting:74 - Starting remoting
2015-06-13 10:15:36 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://sparkDriver@res1cwang-m1.home:58941]
2015-06-13 10:15:36 INFO Utils:59 - Successfully started service 'sparkDriver' on port 58941.
2015-06-13 10:15:36 INFO SparkEnv:59 - Registering MapOutputTracker
2015-06-13 10:15:36 INFO SparkEnv:59 - Registering BlockManagerMaster
2015-06-13 10:15:36 INFO DiskBlockManager:59 - Created local directory at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-local-20150613101536-86b2
2015-06-13 10:15:36 INFO MemoryStore:59 - MemoryStore started with capacity 265.1 MB
2015-06-13 10:15:36 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-06-13 10:15:36 INFO HttpFileServer:59 - HTTP File server directory is /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-914594ab-f207-47cf-b092-1bb988a1cd0c
2015-06-13 10:15:36 INFO HttpServer:59 - Starting HTTP Server
2015-06-13 10:15:37 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 10:15:37 INFO AbstractConnector:338 - Started SocketConnector@0.0.0.0:58942
2015-06-13 10:15:37 INFO Utils:59 - Successfully started service 'HTTP file server' on port 58942.
2015-06-13 10:15:37 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 10:15:37 INFO AbstractConnector:338 - Started SelectChannelConnector@0.0.0.0:4040
2015-06-13 10:15:37 INFO Utils:59 - Successfully started service 'SparkUI' on port 4040.
2015-06-13 10:15:37 INFO SparkUI:59 - Started SparkUI at http://res1cwang-m1.home:4040
2015-06-13 10:15:37 INFO SparkContext:59 - Added JAR file:/Users/cwang/Downloads/wiki2vec-master/target/scala-2.10/wiki2vec-assembly-1.0.jar at http://192.168.1.8:58942/jars/wiki2vec-assembly-1.0.jar with timestamp 1434204937819
2015-06-13 10:15:37 INFO AkkaUtils:59 - Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@res1cwang-m1.home:58941/user/HeartbeatReceiver
2015-06-13 10:15:37 INFO NettyBlockTransferService:59 - Server created on 58943
2015-06-13 10:15:37 INFO BlockManagerMaster:59 - Trying to register BlockManager
2015-06-13 10:15:37 INFO BlockManagerMasterActor:59 - Registering block manager localhost:58943 with 265.1 MB RAM, BlockManagerId(, localhost, 58943)
2015-06-13 10:15:37 INFO BlockManagerMaster:59 - Registered BlockManager
java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:90)
at scala.io.Source$.fromFile(Source.scala:75)
at scala.io.Source$.fromFile(Source.scala:53)
at scala.io.Source$.fromFile(Source.scala:59)
at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58)
at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
using empty redirect store..
2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(159118) called with curMem=0, maxMem=278019440
2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 265.0 MB)
2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(22692) called with curMem=159118, maxMem=278019440
2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.0 MB)
2015-06-13 10:15:38 INFO BlockManagerInfo:59 - Added broadcast_0_piece0 in memory on localhost:58943 (size: 22.2 KB, free: 265.1 MB)
2015-06-13 10:15:38 INFO BlockManagerMaster:59 - Updated info of block broadcast_0_piece0
2015-06-13 10:15:38 INFO SparkContext:59 - Created broadcast 0 from textFile at Word2VecCorpus.scala:25
2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(40) called with curMem=181810, maxMem=278019440
2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 265.0 MB)
2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(86) called with curMem=181850, maxMem=278019440
2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 86.0 B, free 265.0 MB)
2015-06-13 10:15:38 INFO BlockManagerInfo:59 - Added broadcast_1_piece0 in memory on localhost:58943 (size: 86.0 B, free: 265.1 MB)
2015-06-13 10:15:38 INFO BlockManagerMaster:59 - Updated info of block broadcast_1piece0
2015-06-13 10:15:38 INFO SparkContext:59 - Created broadcast 1 from broadcast at Word2VecCorpus.scala:30
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.tip.id is deprecated. Instead, use mapreduce.task.id
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.job.id is deprecated. Instead, use mapreduce.job.id
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1074)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:849)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1164)
at org.idio.wikipedia.word2vec.Word2VecCorpus.getWord2vecCorpus(Word2VecCorpus.scala:139)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:185)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Joining corpus..
prepare.sh: line 88: /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus: No such file or directory
^**^ corpus : /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus
There is no data folder in my wiki2vec-master folder. Could you please advise how I can fix it?
Thanks in advance!