idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
601 stars 137 forks source link

enwiki-latest-pages-articles-multistream.xml.bz2 not a valid bz2 #35

Open Aditi138 opened 7 years ago

Aditi138 commented 7 years ago

Hi,

I was running the prepare.sh file for en-US and its throwing the following exception, because of which the generated corpus is empty. Can you please suggest some alternate solution?

Exception in thread "main" java.io.IOException: Stream is not in the BZip2 format at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:255) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:138) at org.idio.wikipedia.dumps.ReadableWiki.getWikipediaStream(ReadableWiki.scala:19) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:31) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)

jind11 commented 6 years ago

I have the same issue, have you solved it?

dav009 commented 6 years ago

gonna check the format of the dump

jind11 commented 6 years ago

I found the problem comes from the problem of "curl -O "http://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-pages-articles-multistream.xml.bz2" downloading, it only give me a 186 Bytes file that is wrong. Instead I changed to "curl -L -O "http://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-pages-articles-multistream.xml.bz2", now I can download the correct 14 GB file and the problem is resolved.

tgalery commented 6 years ago

@jind11 can you send a PR ?

jind11 commented 6 years ago

@tgalery sure, I also have several other bug fixers for the prepare.sh file, I will upload the PR these two days