spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
770 stars 129 forks source link

Wikivoyage plugin introduces "undefined" text blocks #557

Closed zhibek closed 10 months ago

zhibek commented 11 months ago

Enabling the wtf-plugin-wikivoyage plugin introduces extra "undefined" text blocks into the parsed document.

Minimal example:

import wtf from 'wtf_wikipedia';

import wtf_plugin_wikivoyage from 'wtf-plugin-wikivoyage';
wtf.extend(wtf_plugin_wikivoyage);

const doc = await wtf.fetch('https://en.wikivoyage.org/wiki/Ya_Nui');
console.log(JSON.stringify(doc.json(), true, 2));

Output:

{
  "title": "Ya Nui",
  "pageID": 39948,
  "categories": [],
  "sections": [
    {
      "title": "",
      "depth": 0,
      "paragraphs": [
        {
          "sentences": [
            {
              "text": "undefined"
            },
            {
              "text": "Ya Nui Beach (หาดยะนุ้ย Hat Ya Nui) is a beach in Phuket.",
...

If wtf-plugin-wikivoyage plugin is not used, the "undefined" text block isn't present.

Sample output without wtf-plugin-wikivoyage plugin active:

{
  "title": "Ya Nui",
  "pageID": 39948,
  "categories": [],
  "sections": [
    {
      "title": "",
      "depth": 0,
      "paragraphs": [
        {
          "sentences": [
            {
              "text": "Ya Nui Beach (หาดยะนุ้ย Hat Ya Nui) is a beach in Phuket.",
...

The wtf-plugin-wikivoyage plugin is quite small, so I've tried selectively commenting-out parts of the functionality there to see what introduces the problem. I found the pagebanner: (tmpl, list, parser) => { ... } block is the culprit. I don't know why it causes a problem, but I'll make a PR excluding this logic to illustrate. Possibly in logic used in parsing pagebanner is incompatible with more recent changes in the wtf-wikipedia core?