Closed Atreyee closed 10 years ago
The tool has been written to import pages-articles.xml.bz2 - never even looked at the pages-meta-current and pages-articles-multistream you mention I'm afraid.
Alright. Ill probably take a look and try to make it generic for that as well?
I end up with a 7.8Gb graph database but the results of the queries are a bit raw - numbers on the nodes really. Now I'm only just beginning with Neo4j; is there a way to make the queries return more interesting results?
Performance on a high spec 32gb laptop is very heavy on memory if you decide to put all the internet on it..well all of Wikipedia anyway!
The last two comments seem completely unrelated to the original issue.
When I use the dump at http://dumps.wikimedia.org/enwiki/20130805/ (enwiki-20130805-pages-meta-current.xml.bz2) which is 18.4GB in size with the importer, I get the same number of nodes and relationships as with the enwiki-20130805-pages-articles-multistream.xml.bz2 dump which is 10.1 GB in size. i.e 10 310 502 nodes and 96 711 488 relationships. The importer tool reports a lot of broken links. But from Wikipedia statistics at http://en.wikipedia.org/wiki/Wikipedia:Statistics there should be 31,728,583 pages i.e 31,728,583 nodes. From what I know the 10 GB dataset basically is just articles without talks or user pages, while the 18.4 GB dataset is all pages. Is there something around the parsing logic that is ignoring the data items in the 18GB dataset?