mischkew / jwpl

Automatically exported from code.google.com/p/jwpl
0 stars 0 forks source link

New field in pagelinks table breaks pagelinkparser #128

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
In July 2014, a new field was added to the pagelinks table in the Wikipedia 
dumps. However, our pagelinks parser expects insert commands with three fields 
and cannot cope with the four fields we have now.

Upgrading to the new format is easy, but will break downward compatibility. We 
need to check the format of a given pagelinks.sql before parsing the file.

Original issue reported on code.google.com by oliver.ferschke on 4 Aug 2014 at 3:57

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r784.

Original comment by oliver.ferschke on 5 Aug 2014 at 1:38

GoogleCodeExporter commented 9 years ago

Original comment by oliver.ferschke on 5 Aug 2014 at 2:56

GoogleCodeExporter commented 9 years ago

Original comment by oliver.ferschke on 11 Sep 2014 at 1:36

GoogleCodeExporter commented 9 years ago
To be more detailed, in July 2014, the layout of the pagelinks table changed 
from

DROP TABLE IF EXISTS `pagelinks`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `pagelinks` (
  `pl_from` int(8) unsigned NOT NULL DEFAULT '0',
  `pl_namespace` int(11) NOT NULL DEFAULT '0',
  `pl_title` varbinary(255) NOT NULL DEFAULT '',
  UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
  KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary;
/*!40101 SET character_set_client = @saved_cs_client */;

to

DROP TABLE IF EXISTS `pagelinks`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `pagelinks` (
  `pl_from` int(8) unsigned NOT NULL DEFAULT '0',
  `pl_namespace` int(11) NOT NULL DEFAULT '0',
  `pl_title` varbinary(255) NOT NULL DEFAULT '',
  `pl_from_namespace` int(11) NOT NULL DEFAULT '0',
  UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
  KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`),
  KEY `pl_backlinks_namespace` (`pl_namespace`,`pl_title`,`pl_from_namespace`,`pl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary;
/*!40101 SET character_set_client = @saved_cs_client */;

This influenced the size of the data tuples to grow from 3 to 4. 
The pagelinkgsparser now supports both formats, but it ignores the new field.

Original comment by oliver.ferschke on 16 Sep 2014 at 9:16