spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
770 stars 129 forks source link

Add support for revisionID #568

Closed dagingaa closed 10 months ago

dagingaa commented 10 months ago

This change adds initial support for revisionID as passed in through options. This is useful because one can use this to check for revision changes between two wikipedia dumps, like when using dumpster-dip on a monthly basis to keep a search database up-to-date (for RAG for example).

Mostly I just missed having this, and I plan to submit a follow-up PR to dumpster-dip to have it parse the revisionID and pass it in so I can use it.

Note that this change does not include updating the README and types yet, I will do that, but I wanted to wait for feedback on naming etc. first.

dagingaa commented 10 months ago

Mostly because regex is utterly unreadable, here's an explanation curtesy of ChatGPT:

Explanation:

spencermountain commented 10 months ago

this is spectacular. Thank you. I've put this on dev branch, so it can make the next release, which should be in a few days. I've added a typescript support for the new method, feel-free to document things, as you see fit. cheers!

spencermountain commented 10 months ago

just kidding - this is released in 10.2.0 will get to updating dumpster-dip this week. thanks for the help!

spencermountain commented 10 months ago

hey, could we also grab revisionID from the api when we do a fetch? @MarketingPip - wanna take a crack at it? this is a cool feature. cheers

MarketingPip commented 10 months ago

@spencermountain - sure can.

I don't think this messes anything up but - wanna take a look see?

https://en.wikipedia.org/w/api.php?action=query&prop=revisions%7Cpageprops&rvprop=content|ids&maxlag=5&rvslots=main&origin=*&format=json&redirects=true&titles=Toronto_Raptors

Note: the ids prop added for reference in future. I will make PR in advanced, run some texts and see what else you want to grab. I will get rev / parent id. And do you want an option to search via rev id as well?

MarketingPip commented 10 months ago

@spencermountain - I got most of the work done for getting revisionID. I will let you make / do the work for making the query for looking for specific revision via query. (if you decide you will support that).

That said - in a junk / play branch. I modified the test / expected results for the Italian and CSGO wikipedia, tho - I am afraid this will cause issues when you go to build in future when a revision changes and not the same. Let me know how you want me to modify the test & I will submit tomorrow or the next day etc..

spencermountain commented 10 months ago

ah, perfect. yeah, that's great. Are you thinking of this?

wtf('Fubar', {revisionID: '372618'})

to fetch an older version? never though of that - that would be cool. As long as it doesn't get really complicated - Go for it!

thanks for your help

MarketingPip commented 10 months ago

ah, perfect. yeah, that's great. Are you thinking of this?

wtf('Fubar', {revisionID: '372618'})

to fetch an older version? never though of that - that would be cool. As long as it doesn't get really complicated - Go for it!

thanks for your help

@spencermountain - I am grabbing current revision ID (but I will see about grabbing a previous version if it doesn't get messy).