spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
770 stars 129 forks source link

getIncoming() crashes for some pages #517

Open chris-gassner opened 1 year ago

chris-gassner commented 1 year ago

I'm trying to fetch incoming links for pages and some docs cause a crash when calling getIncoming().

Trying to fetch incoming links for the article 'Europe' fails with:

=-=- http response error =-=-=-
https://en.wikipedia.org/w/api.php?action=query&lhnamespace=0&prop=linkshere&lhshow=!redirect&lhlimit=500&format=json&origin=*&redirects=true&titles=Europe&lhcontinue=566556
FetchError: invalid json response body at https://en.wikipedia.org/w/api.php?action=query&lhnamespace=0&prop=linkshere&lhshow=!redirect&lhlimit=500&format=json&origin=*&redirects=true&titles=Europe&lhcontinue=566556 reason: Unexpected token < in JSON at position 0
 at X:\node-projects\wiki\node_modules\node-fetch\lib\index.js:273:32
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async getIncoming (X:\node-projects\wiki\node_modules\wtf-plugin-api\builds\wtf-plugin-api.cjs:110:31)
    at async X:\node-projects\wiki\index.js:385:22 {
  type: 'invalid-json'
}

while getIncoming() works for 'Javascript' or 'Briefcase' for example. I'm guessing this is probably related to the number incoming links. The europe article has 86,136 direct links according to https://linkcount.toolforge.org/?project=en.wikipedia.org&page=Europe&namespaces= The article Python (programming language) has 9,467 links according to https://linkcount.toolforge.org/?project=en.wikipedia.org&page=Python%20(programming%20language)&namespaces= but I get back 3718 pageids when calling getIncoming.

Not a big deal, just thought I'd let you know though.

spencermountain commented 1 year ago

hey Christoph, thanks for the good issue. Yeah - i think you're right about an timeout for some pages. The api plugin loops around and fetches things 500 at a time.

I looked into the python example - the getIncoming method is only returning pages that are wikipedia articles (namespace 0) and not other wikipedia internal stuff. I think the python discrepency is from User talk pages - haha, people are using this template on their profile pages.

Please let me know if you can track down other cases with missing articles. The Europe case needs some thinking. Maybe we could try lowering the limit down from 500. The code is here if anyone is interested. cheers