Open Aditi138 opened 7 years ago
I have the same issue, have you solved it?
gonna check the format of the dump
I found the problem comes from the problem of "curl -O "http://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-pages-articles-multistream.xml.bz2" downloading, it only give me a 186 Bytes file that is wrong. Instead I changed to "curl -L -O "http://dumps.wikimedia.org/${LANGUAGE}wiki/latest/${LANGUAGE}wiki-latest-pages-articles-multistream.xml.bz2", now I can download the correct 14 GB file and the problem is resolved.
@jind11 can you send a PR ?
@tgalery sure, I also have several other bug fixers for the prepare.sh file, I will upload the PR these two days
Hi,
I was running the prepare.sh file for en-US and its throwing the following exception, because of which the generated corpus is empty. Can you please suggest some alternate solution?
Exception in thread "main" java.io.IOException: Stream is not in the BZip2 format at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:255) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:138)
at org.idio.wikipedia.dumps.ReadableWiki.getWikipediaStream(ReadableWiki.scala:19)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:31)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)