idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
601 stars 137 forks source link

Unable to process the wiki #1

Closed ostastny closed 9 years ago

ostastny commented 9 years ago

Hi,

I am trying out your solution but it keeps on failing when I try to execute

sudo java -Xmx10G -Xms10G -cp /datadrive/wiki2vec/target/scala-2.10/wiki2vec-assembly-1.0.jar org.idio.wikipedia.dumps.CreateReadableWiki //datadrive/data/wiki-latest-pages-articles-multistream.xml.bz2

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:54) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)

# There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 7158628352 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap space # In 32 bit mode, the process size limit was hit # Possible solutions: # Reduce memory load on the system # Increase physical memory or swap space # Check if swap backing store is full

My machine has 14gb of RAM, runs Ubuntu 14.04 LTS, and java version is 1.7.0_76. I tried playing with the -Xmx and -Xms arguments, running in 64bit mode with -d64 but all to no avail.

dav009 commented 9 years ago

It seems you are missing one of the arguments ( the output filepath), try adding the extra argument i.e:

sudo java -Xmx10G -Xms10G -cp /datadrive/wiki2vec/target/scala-2.10/wiki2vec-assembly-1.0.jar org.idio.wikipedia.dumps.CreateReadableWiki //datadrive/data/wiki-latest-pages-articles-multistream.xml.bz2 //datadrive/data/readablewiki.lines


CreateReadableWiki should not be running out of memory since it reads from a stream and outputs to a file.

  1. Did you try running the the automated script?
  2. What wikipedia language are you trying to process ?
ostastny commented 9 years ago

Yep, that was it! But now it runs for a a minute or so and then crashes with following:

Exception in thread "main" java.io.IOException: unexpected end of stream at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.getAndMoveToFrontDecode(BZip2CompressorInputStream.java:744) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:326) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:884) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:933) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:228) at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:179) at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source) at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)

Is it perhaps a corrupted download?

dav009 commented 9 years ago

I will update the README, as it seems that output parameter is a bit confusing.

ostastny commented 9 years ago

I am trying the English wiki dump. I am re-downloading it now, will let you know if the problem persists. I will close this issue as the original problem was indeed solved. Thanks again.

dav009 commented 9 years ago

We have generated some big models (1000Dimensions) cbow. For various languages and we plan to share them soon via torrents in case you are interested.

ostastny commented 9 years ago

Definitely interested! Keep me posted!

Ondrej

From: David Przybilla Sent: ‎Tuesday‎, ‎March‎ ‎17‎, ‎2015 ‎10‎:‎08‎ ‎PM To: idio/wiki2vec Cc: Ondrej

We have generated some big models (1000Dimensions) cbow. For various languages and we plan to share them soon via torrents in case you are interested.

— Reply to this email directly or view it on GitHub.

dav009 commented 9 years ago

@ostastny check the readme for instructions on how to get the english model via torrent

Stamenov commented 8 years ago

@dav009 Hi,

i am also trying to create model from the german wikipedia (dewiki), but failing. Could you be so kind to provide a torrent file for the german dump as well?

Thanks, Greets.

dav009 commented 8 years ago

@Stamenov could you please provide/create an issue for it, it would be good to know what is failing.

I will try to create an updated german model

Stamenov commented 8 years ago

Well I am using a Mac and it is not optimized for it. The point where i failled was finding the right class path for the main class with the java command. I did not get pass that. I can not say it is an issue, but rather me giving in to my frustration setting eveything up manually. A manual for MacOS would be great. Greets.

dav009 commented 8 years ago

@Stamenov definitely prepare.sh assumes it is an ubuntu/debian system.

I added a torrent for an old german model (feb 2015). Please give it a try downloading it, and seed it for a while if possible.

https://github.com/idio/wiki2vec/blob/master/torrents/dewiki-gensim-word2vec-300-nostem-10cbow.torrent

Stamenov commented 8 years ago

Thanks for the prompt response! I started downloading it, looks good. Thanks again, Greets!

Stamenov commented 8 years ago

@dav009 Ist the file correct beim only 1.5gb? The initial wikipedia dump was arround 4gb and I am getting strange results. Greets

dav009 commented 8 years ago

were you able to uncompress it? the german model is smaller than the english one.

What sort of strange results are you getting?

dav009 commented 8 years ago

@ostastny I would like to know if the torrent is corrupted to re-create it.

Good to get some feedback on the error you are getting

Stamenov commented 8 years ago

I think the torrent is ok, since it all works and i was able to uncompress it.