hut8 / wikiwalk

Shortest path algorithm between pages on Wikipedia
MIT License
3 stars 0 forks source link

wikiwalk

Shortest path algorithm between pages on Wikipedia. Let's say that you want to get from page Exploding animal to Cucumber. This thing will find the set of shortest paths between them, and return the path as a list of page titles.

Prior Art

Graph Database

There are two files that comprise the adjacency list database of edges:

vertex_al

Vertex Adjacency List

An array of records of edges in this format:

null        ⇒ 0 as u32
vertex      ⇒ u32
vertexes    ⇒ vertex+
edges       ⇒ outgoing_vertexes, null, incoming_vertexes, null

To determine which vertex each record belongs to, see below (vertex_al_ix)

vertex_al_ix

Vertex Adjacency List Index

An array of u32 indexed by vertex ID. Each u32 is the offset into vertex_al at which the edge data is located. So to load the record for the page with ID 1337:

Importing MySQL data for test

If you want to spot-check some data from MySQL, it's faster to import the dumps without indexes. First define the tables thusly:

CREATE TABLE `pagelinks` (
  `pl_from` int(8) unsigned NOT NULL DEFAULT 0,
  `pl_from_namespace` int(11) NOT NULL DEFAULT 0,
  `pl_target_id` bigint(20) unsigned NOT NULL
  ) ENGINE=InnoDB DEFAULT CHARSET=binary;

 CREATE TABLE `redirect` (
  `rd_from` int(8) unsigned NOT NULL DEFAULT 0,
  `rd_namespace` int(11) NOT NULL DEFAULT 0,
  `rd_title` varbinary(255) NOT NULL DEFAULT '',
  `rd_interwiki` varbinary(32) DEFAULT NULL,
  `rd_fragment` varbinary(255) DEFAULT NULL
  ) ENGINE=InnoDB DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;

 CREATE TABLE `page` (
  `page_id` int(8) unsigned NOT NULL,
  `page_namespace` int(11) NOT NULL DEFAULT 0,
  `page_title` varbinary(255) NOT NULL DEFAULT '',
  `page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT 0,
  `page_is_new` tinyint(1) unsigned NOT NULL DEFAULT 0,
  `page_random` double unsigned NOT NULL DEFAULT 0,
  `page_touched` binary(14) NOT NULL,
  `page_links_updated` varbinary(14) DEFAULT NULL,
  `page_latest` int(8) unsigned NOT NULL DEFAULT 0,
  `page_len` int(8) unsigned NOT NULL DEFAULT 0,
  `page_content_model` varbinary(32) DEFAULT NULL,
  `page_lang` varbinary(35) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=binary;

CREATE TABLE `linktarget` (
  `lt_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `lt_namespace` int(11) NOT NULL,
  `lt_title` varbinary(255) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=binary;

Then read in dumps, skipping DDL

pv enwiki-*-pagelinks.sql | tail +34 | mysql wiki
pv enwiki-*-page.sql | tail +45 | mysql wiki
pv enwiki-*-redirect.sql | tail +38 | mysql wiki
pv enwiki-*-linktarget.sql | tail +34 | mysql wiki

Then build indexes

create index page_title_ix on page (page_title);
create index page_title_id on page(page_id);
create index page_title_namespace on page(page_namespace);

To export for later:

select * from page into outfile '/tmp/wiki-page-dump';
select * from pagelinks into outfile '/tmp/wiki-pagelinks-dump';
select * from redirect into outfile '/tmp/wiki-redirect-dump';
select * from linktarget into outfile '/tmp/wiki-linktarget-dump';

Then compress and such:

sudo mv /tmp/wiki-*-dump ~/data
sudo chown $(id -u):$(id -g) ~/data/wiki-*-dump
zstd -T0 ~/data/wiki-*-dump

Then to import (assuming wiki-page-dump is on the server at some location):

LOAD DATA INFILE 'wiki-page-dump' INTO TABLE page;
LOAD DATA INFILE 'wiki-pagelinks-dump' INTO TABLE pagelinks;
LOAD DATA INFILE 'wiki-redirect-dump' INTO TABLE redirect;
LOAD DATA INFILE 'wiki-linktarget-dump' INTO TABLE linktarget;

SQLite Databases

There are two SQLite databases:

Queries

To find large paths:

SELECT
  sv.title AS 'source_title',
  tv.title AS 'target_title',
  coalesce(json_array_length(json_extract(path_data, '$[0]')),0) AS degrees
  FROM path p
  INNER JOIN vertexes sv ON p.source_page_id=sv.id
  INNER JOIN vertexes tv ON p.target_page_id=tv.id
  WHERE degrees >= 5
  ORDER BY degrees DESC
  LIMIT 20;

Database build process

The build process is a bit involved, so here's a high-level overview of how it's done in code.

Prior to July 1, 2024, linktarget was unnecessary, and the way to generate the list of links (from Page ID to Page ID) was: