idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
601 stars 137 forks source link

Issue creating corpus #32

Open RishabGargeya opened 7 years ago

RishabGargeya commented 7 years ago

Getting this error:

[info] Assembly up to date: /home/rg203/work/scripts/wiki2vec/target/scala-2.10/wiki2vec-assembly-1.0.jar
[success] Total time: 2 s, completed Jan 5, 2017 7:29:26 AM
Creating Readable Wiki..
Exception in thread "main" java.io.IOException: Stream is not in the BZip2 format
    at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:255)
    at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.<init>(BZip2CompressorInputStream.java:138)
    at org.idio.wikipedia.dumps.ReadableWiki.getWikipediaStream(ReadableWiki.scala:19)
    at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:31)
    at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
    at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
/home/rg203/work/scripts/wiki2vec/working/spark-1.2.0-bin-hadoop2.4/bin/spark-class: line 113: [: : integer expression expected
/home/rg203/work/scripts/wiki2vec/working/spark-1.2.0-bin-hadoop2.4/bin/spark-class: line 187: /usr/lib/jvm/java-8-oracle/jre/bin/java/bin/java: Not a directory
/home/rg203/work/scripts/wiki2vec/working/spark-1.2.0-bin-hadoop2.4/bin/spark-class: line 187: exec: /usr/lib/jvm/java-8-oracle/jre/bin/java/bin/java: cannot execute: Not a directory
Joining corpus..
cat: 'part*': No such file or directory
 ^___^ corpus : /home/rg203/work/scripts/wiki2vec/spanish_output//eswiki.corpus

Any ideas? Thanks for the help!

keynmol commented 7 years ago

We'd need more info to debug that. Are you sure you're giving it a .bz2 compressed wikipedia dump?