Closed ostastny closed 9 years ago
It seems you are missing one of the arguments ( the output filepath), try adding the extra argument i.e:
sudo java -Xmx10G -Xms10G -cp /datadrive/wiki2vec/target/scala-2.10/wiki2vec-assembly-1.0.jar org.idio.wikipedia.dumps.CreateReadableWiki //datadrive/data/wiki-latest-pages-articles-multistream.xml.bz2 //datadrive/data/readablewiki.lines
CreateReadableWiki should not be running out of memory since it reads from a stream and outputs to a file.
Yep, that was it! But now it runs for a a minute or so and then crashes with following:
Exception in thread "main" java.io.IOException: unexpected end of stream at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.getAndMoveToFrontDecode(BZip2CompressorInputStream.java:744) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:326) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:884) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:933) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:228) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:179) at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source) at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Is it perhaps a corrupted download?
I will update the README, as it seems that output
parameter is a bit confusing.
I am trying the English wiki dump. I am re-downloading it now, will let you know if the problem persists. I will close this issue as the original problem was indeed solved. Thanks again.
We have generated some big models (1000Dimensions) cbow. For various languages and we plan to share them soon via torrents in case you are interested.
Definitely interested! Keep me posted!
Ondrej
From: David Przybilla Sent: Tuesday, March 17, 2015 10:08 PM To: idio/wiki2vec Cc: Ondrej
We have generated some big models (1000Dimensions) cbow. For various languages and we plan to share them soon via torrents in case you are interested.
— Reply to this email directly or view it on GitHub.
@ostastny check the readme for instructions on how to get the english model via torrent
@dav009 Hi,
i am also trying to create model from the german wikipedia (dewiki), but failing. Could you be so kind to provide a torrent file for the german dump as well?
Thanks, Greets.
@Stamenov could you please provide/create an issue for it, it would be good to know what is failing.
I will try to create an updated german model
Well I am using a Mac and it is not optimized for it. The point where i failled was finding the right class path for the main class with the java command. I did not get pass that. I can not say it is an issue, but rather me giving in to my frustration setting eveything up manually. A manual for MacOS would be great. Greets.
@Stamenov definitely prepare.sh assumes it is an ubuntu/debian system.
I added a torrent for an old german model (feb 2015). Please give it a try downloading it, and seed it for a while if possible.
Thanks for the prompt response! I started downloading it, looks good. Thanks again, Greets!
@dav009 Ist the file correct beim only 1.5gb? The initial wikipedia dump was arround 4gb and I am getting strange results. Greets
were you able to uncompress it? the german model is smaller than the english one.
What sort of strange results are you getting?
@ostastny I would like to know if the torrent is corrupted to re-create it.
Good to get some feedback on the error you are getting
I think the torrent is ok, since it all works and i was able to uncompress it.
Hi,
I am trying out your solution but it keeps on failing when I try to execute
sudo java -Xmx10G -Xms10G -cp /datadrive/wiki2vec/target/scala-2.10/wiki2vec-assembly-1.0.jar org.idio.wikipedia.dumps.CreateReadableWiki //datadrive/data/wiki-latest-pages-articles-multistream.xml.bz2
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:54) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 7158628352 bytes for committing reserved
memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
My machine has 14gb of RAM, runs Ubuntu 14.04 LTS, and java version is 1.7.0_76. I tried playing with the -Xmx and -Xms arguments, running in 64bit mode with -d64 but all to no avail.