tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
789 stars 82 forks source link

spurious "canonical" forms with terms like "x or y" #327

Closed jmviz closed 11 months ago

jmviz commented 1 year ago

See e.g.

true or false:

"forms": [
    {
      "form": "true",
      "tags": [
        "canonical"
      ]
    },
    {
      "form": "false",
      "tags": [
        "canonical"
      ]
    }
  ],

feast or famine:

"forms": [
    {
      "form": "feast",
      "tags": [
        "canonical"
      ]
    },
    {
      "form": "famine",
      "tags": [
        "canonical"
      ]
    }
  ],

believe it or not:

"forms": [
    {
      "form": "believe it",
      "tags": [
        "canonical"
      ]
    },
    {
      "form": "not",
      "tags": [
        "canonical"
      ]
    }
  ],
kristian-clausal commented 1 year ago

Well, the reason why this happens is pretty obvious. Probably needs a comparison to the article title somewhere. I'll take a look.

EDIT: oh no, it's in decode_tags 😩

kristian-clausal commented 1 year ago

Turned out it was a trivial addition to the condition of an if block I made like six months ago to handle this exact problem, except for titles with commas! Hurray! Also, it wasn't in decode_tags.

kristian-clausal commented 12 months ago

Had a bug in this (which manifested an older bug that had possibly never triggered, but I digress) that should be fixed now: searched for or in title instead of \ or\ in title, which broke a bunch of Latin words ending in -or for some reason. Added a couple of tests for this at the same time. EDIT: The kaikki production server was physically down for a while, so took a while for this to show up.

kristian-clausal commented 11 months ago

Kaikki has updated now, and it seems the data is better for now. Closing this; if you find anything similarly wrong relating to ' or ' in titles, just post here and I'll reopen.