spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
778 stars 129 forks source link

Ability to recursively bring back all pages within a category #524

Closed tstibbs closed 1 year ago

tstibbs commented 1 year ago

Putting this up as in-progress work for review.

I previously had some code which wrapped the getCategory function and recursively went down through all categories and sub-categories and returned all articles that were under any of those categories or sub-categories. I don't see anything in the wikimedia api that would help you do that server-side, hence doing it in code. I can't be the only person who wants to do this, so seems sensible to maintain my code as part of wtf_wikipedia rather than just in my one project.

Outstanding work:

Notes about the effectiveness of this code:

Closes #521.

spencermountain commented 1 year ago

hey - this is good! does the recursive one always go ... until the end? That's awesome. Do you wanna support a max depth parameter? I'm thinking abt doing Category:Person or something. how far do you think it goes? Maybe it would be fun to build a variant that returns the nested json.

Lemme know when I can merge and release this thing. Thank you!

tstibbs commented 1 year ago

does the recursive one always go ... until the end? That's awesome. Do you wanna support a max depth parameter?

Yeah, it'll go right to the bottom with two exceptions:

I hadn't added a depth param because the depth of the recursion doesn't really mean anything semantically in terms of what you're asking it to bring back. I guess it would be useful as a safety net for any really big categories, though I suspect you wouldn't think to set the depth until one day when one of your category requests exploded - at which point you'd probably change your request to ask for more specific categories anyway. So I'm slightly in two minds about how useful such a param would be, but it'll be pretty simple to add while I'm in the code already so I'm happy to do that.

Question: error when depth exceeded, or just silently stop the recursion when you hit the max depth? I can see dis/advantages of both.

Maybe it would be fun to build a variant that returns the nested json.

You mean it returns all the sub-categories and their members in something like a tree structure? That would be pretty simple if wikipedia was a tree structure - but it isn't, it's a graph, and not even an acyclic graph. So an article could appear in multiple places in your tree of categories. Equally, where there is a loop in a category hierarchy, not sure how you'd express that? I guess you could return a graph of all the categories and their members (somehow, the data structure would take some thought) but might be difficult to process in to something useful once you've got it.

Regarding merging, I think the current changes are backwards-compatible so mergeable now anyway if you want, but I can probably add the max depth thing fairly quickly so up to you if you want to have a second PR for that or include it all here.

spencermountain commented 1 year ago

good point!

tstibbs commented 1 year ago

I'm happy to add the max depth thing if you want, just let me know. If not, then I think I'm done with my changes for now.

spencermountain commented 1 year ago

yeah, okay let's do it. just a safety brake that you can put on, but off by default? no error. like - 'grab this list of astronauts, but don't accidentally ddos for 18 hours'. I'll can hit publish this afternoon. Big relief to have this done.

tstibbs commented 1 year ago

Added the depth thing in #525