mirkonasato / graphipedia

Creates a Neo4j graph of Wikipedia links.
254 stars 63 forks source link

Less number of nodes created than expected for the 18 GB wikipedia datadump #2

Closed Atreyee closed 10 years ago

Atreyee commented 10 years ago

When I use the dump at http://dumps.wikimedia.org/enwiki/20130805/ (enwiki-20130805-pages-meta-current.xml.bz2) which is 18.4GB in size with the importer, I get the same number of nodes and relationships as with the enwiki-20130805-pages-articles-multistream.xml.bz2 dump which is 10.1 GB in size. i.e 10 310 502 nodes and 96 711 488 relationships. The importer tool reports a lot of broken links. But from Wikipedia statistics at http://en.wikipedia.org/wiki/Wikipedia:Statistics there should be 31,728,583 pages i.e 31,728,583 nodes. From what I know the 10 GB dataset basically is just articles without talks or user pages, while the 18.4 GB dataset is all pages. Is there something around the parsing logic that is ignoring the data items in the 18GB dataset?

mirkonasato commented 10 years ago

The tool has been written to import pages-articles.xml.bz2 - never even looked at the pages-meta-current and pages-articles-multistream you mention I'm afraid.

Atreyee commented 10 years ago

Alright. Ill probably take a look and try to make it generic for that as well?

SoulFireMage commented 10 years ago

I end up with a 7.8Gb graph database but the results of the queries are a bit raw - numbers on the nodes really. Now I'm only just beginning with Neo4j; is there a way to make the queries return more interesting results?

SoulFireMage commented 10 years ago

Performance on a high spec 32gb laptop is very heavy on memory if you decide to put all the internet on it..well all of Wikipedia anyway!

mirkonasato commented 10 years ago

The last two comments seem completely unrelated to the original issue.