spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
779 stars 129 forks source link

v9: Cleanup http interface #410

Open matkoniecz opened 3 years ago

matkoniecz commented 3 years ago

I run into SyntaxError: Unexpected token < in JSON at position 0

It seems that it may be caused by the same issue as #384 - where http error page was parsed as json.

I want to check is it actually happening but right now I have no idea how to get response code of a failing fetch.

// SyntaxError: Unexpected token < in JSON at position 0
wtf.fetch('https://wiki.openstreetmap.org/wiki/Tag:highway%3Dmotorway').then((doc) => {
    //console.log(doc.sentences(0).text())
    //console.log(doc.infobox())
    //console.log(doc.infobox().json())
})

// works
wtf.fetch('Poland').then((doc) => {
  console.log(doc.sentences(0).text())
  console.log(doc.infobox())
  console.log(doc.infobox().json())
})

/*
// from readme sample
// fails to parse infobox
wtf.fetch('https://muppet.fandom.com/wiki/Miss_Piggy').then((doc) => {
    console.log(doc.sentences(0).text())
    console.log(doc.infobox())
    if(doc.infobox() != undefined) {
        console.log(doc.infobox().json())
    }
})

*/
spencermountain commented 3 years ago

ah, thanks Mateusz, good catch. Will take a look this week. cheers

spencermountain commented 3 years ago

oh, looks like the openstreetmap api is using a custom path - this should do the trick:

wtf.fetch('https://wiki.openstreetmap.org/wiki/Tag:highway%3Dmotorway', { path: '/w/api.php' }).then((doc) => {
  console.log(doc.templates('ValueDescription'))
})

we're not recognizing the openstreetmap infoboxes, but we should - do you wanna work together on a openstreetmap plugin? That would be amazing.

spencermountain commented 3 years ago

yeah, and same for the 'Character' infobox template on the muppet wiki. Knowing the difference between a regular template and an infobox is really handy, but we're not doing a great job at it, clearly, for 3rd-party wikis especially. Open to any ideas.

matkoniecz commented 3 years ago

we're not recognizing the openstreetmap infoboxes, but we should - do you wanna work together on a openstreetmap plugin? That would be amazing.

What kind of info would be needed? Names of templates used as infoboxes or something else?

Open to any ideas.

I guess that one may try to detect inbox (first template with multiple parameters), but such heuristic sounds like something that will cause more work than will save. And as demonstrated by OSM Wiki not all infoboxes have "Infobox" in its name.

Translations pages on OSM Wiki are even more evil, with infobox loaded via black magic from local Wikidata database.

matkoniecz commented 3 years ago

oh, looks like the openstreetmap api is using a custom path - this should do the trick:

THANKS! I was really confused about what is happening.

spencermountain commented 3 years ago

yeah - i just went for a walk and was thinking the same thing - a heuristic about an early template, with a bunch of properties, and an 'image' or a 'name' one? That's a great idea.

I can add support for KeyDescription, ValueDescription, Place, Tag, RelatedTermList - are there any other heavily-used ones?

matkoniecz commented 3 years ago

and an 'image' or a 'name' one

Yeah, that should help.

I can add support for KeyDescription, ValueDescription, Place, Tag, RelatedTermList - are there any other heavily-used ones?

KeyDescription, ValueDescription Place - yes, infoboxes

Strictly speaking RelatedTermList is not an infobox, just list of related search terms in a weird formar to help searching work properly.

Template:Tag is going to be used inline - see for example infoboxless https://wiki.openstreetmap.org/wiki/Church - and has its twin Template:Key

wvanderp commented 3 years ago

While I think that solution here is to add a osmwiki plugin. I also think that the fetch function deserves an refactor as well.

I see a few problems with the code there:

Firstly the function signature is a mess and all the parameters can mean all kinds of stuff. Since we a heading to a major/braking release this would be the time to solidify the parameters.

Secondly, error handling can be better. rightnow the error gets given back to the callback but with the prommis is gets caught and logged and it returns success. and there is no way to see into the request error.

Thirdly, there is no way to add random headers directly. If you need to add a header to the list you first need to make a pr to the lib.

and lastly I think that the way we make the http client cross platform is a bit of a hack. Rightnow we use rollup to switch out the server file for the client file. but if you would only read the code then you would assume that this lib was node only. I would advocate checking the existence of window.fetch in code and the switch of that. or even better jet use an external lib like axios for better compatibility. although i also see the drawback of that like that the size of the lib would trippel and adding the first dependency.

spencermountain commented 3 years ago

Hey wouter, ya I agree the fetch method deserves a refactor and i welcome the help. This is a good time. I spent a long time looking at client-server symmetric fetch libraries and I couldn't get things working properly. Yeah- as you mentioned, bundling either http lib would be bad. Im fine with swapping them out in rollup, but maybe we can reduce the dark-magic of it somehow. It would be great if the http lib followed http redirects. Is that something we could do without much overhead? Cheers

spencermountain commented 3 years ago

in addition to allowing users to set custom headers, per @Vacilando , we should set this as a default header - https://www.mediawiki.org/wiki/API:Etiquette#Request_limit recommends to set Accept-Encoding: gzip

wvanderp commented 3 years ago

I was already looking at this for a bit but christmas got in the way. I already wrote out my usual wall of text and will post it tomorrow.

wvanderp commented 3 years ago

As I already said I'm busy with the fetch function. but as always I have a few questions and things I want to check.

Sigiture

When i'm looking at usages of the wtf.fetch on github. I see some old usage of callbacks but mostly promise based usage. I also see more use of language than options.

So my proposal is similar to the currently existing signature

/**
 *  fetches the page from the wiki and returns a Promise with the parsed wikitext
 *
 * @param {string | number | Array<number | string>} title the title, PageID, URL or an array of all three of the page(s) you want to fetch
 * @param {string} [language] the language of the wiki you want to use
 * @param {fetchDefaults} [options] the options for the fetch or the language of the wiki for the article
 * @param {Function} [callback] the callback function for the call
 * @returns {Promise<null | Document | Document[]>} either null if the pages is not found, Document if you asked for one result, and a array of Documents if you asked for multiple pages
 */
const fetch = function (title, language, options, callback) {

And with this signature there is no overlap in the parameters. And it it also preserves some of the compatibility.

Comapat

https://github.com/spencermountain/wtf_wikipedia/blob/8d83a9dda25956d2806ddeca9e4b6fdaf632e9d5/src/_fetch/00-parseUrl.js#L9

I was looking at the parseUrl Function and i was wondering what the //eslint-disable-line was covering up and it was eslint-compat.

It was trying to warn us that the URL class was only added as a global in node 10 and so we can not claim that we support node 7-9.

Easiest solution is bumping node to 10. The most difficult path would be to write our own URL parser. Middle path being importing a URL lib / polyfill.

I'm in favor of bumping to 10. I can't find any stats on node versions usage. But 10 is nearly end of life so I hope nobody will mind.

fetch library

Part of the reason to refactor this part is to clear up the usage of http libraries. So I sought a library that is the same for node and browser and optimised for size.

I settled on (https://www.npmjs.com/package/isomorphic-unfetch)[isomorphic-unfetch] which is a wrapper around the build in fetch function of browsers or a polyfill for older browsers and node-fetch for node. All of these use the same api so we don't have to know on what platform we are.

Also all options for the fetch are available so also redirects.

I also build the project and the size does not change much.

name old new diff
wtf_wikipedia.js 268 kb 271 kb +3 kb
wtf_wikipedia.mjs 252 kb 254 kb +1 kb
wtf_wikipedia-client.js 267 kb 270 kb +3 kb
wtf_wikipedia-client.js.map 524 kb 524 kb +0 kb
wtf_wikipedia-client.min.js 124 kb 117 kb -7 kb
wtf_wikipedia-client.mjs 124 kb 117 kb -7 kb

Options

The options parameter is another part of this refactor I’d like to fix. Rightnow if there is a small problem, for example a missing header, we need to build a new version with that header in the options->header part. With the introduction of fetch I could see that people want to change more options.

So what i propose is moving all the fetch options to a ‘fetchOptions’ key and putting that right in to fetch (with some defaults). And all the other options are for us to configure our code.

This way we can also type all the properties so the auto complete of editors will find some errors.

Code review

On an unrelated note, V9 adds a lot of changes everywhere. I’d like to see a code review before you release V9 and I would be down to do it.

In conclusion

These were my ideas for fetch.

In my research I implemented a large part of the chagens already. But, as always, your feedback is appreciated. I will also clean the other code in the _fetch folder.

spencermountain commented 3 years ago

YESSSSSSSSSSSSSSSSSSSS you are the best. Thank you. plz pr to the dev branch (if that's possible)

spencermountain commented 3 years ago

i scaled-back some tslint stuff since your last pr - there were so many red lines. I couldn't figure out how to fix them. I could switch that back on if you'd like. I was just confused

wvanderp commented 3 years ago

I was working on its ticket last year. 🤣 and i did com far. i have integrated the fetch lib and changed the build script a bit so it uses the right builds for each of our builds. but no i have decision to make.

Im working on the input of the fetch function. given that the wikipedia api does not let you request pages by title and page id in the same request. how do we handle this.

and I saw another ticket about bundling requests in array do we also what to integrate this now that i'm working on this

spencermountain commented 3 years ago

hey wouter, thanks for asking. Yeah, I'd like to move the fancy stuff like that to the api plugin. I'm okay with returning null, or whatever response wikimedia gives us for that. we can even scale the array stuff down in this release if you'd prefer. no rush on this stuff, i've gotten distracted by other things too. I haven't touched any of the http stuff either. thanks!