openzim / wp1

Wikipedia 1.0 engine & selection tools
https://wp1.openzim.org
GNU General Public License v2.0
24 stars 17 forks source link

Question: filtered list of articles for a project #716

Closed tim-moody closed 7 months ago

tim-moody commented 7 months ago

Is it possible to get the list of articles for a project that have changed since some date?

For example, the list of all Project WikiMed articles is in medicine.tsv. I would like to have a list of only those articles that changed since some (recent) date.

kelson42 commented 7 months ago

@tim-moody Why not donwloading the versions you need and make a diff?

tim-moody commented 7 months ago

The purpose is to calculate the list of articles that will change in the next version without having to create the next version to do so.

q = 'https://www.mdwiki.org/w/api.php?action=query&format=json&list=recentchanges&rclimit=max&rctoponly&rcprop=redirect|title'
    q += '&rcnamespace=0&rcstart=now&rcend=' + since

gets all changes since the since date, but for the entire wiki.

For enwp I only want the pages associated with wiki project med.

I thought this might be functionality available in the wp1 api.

kelson42 commented 7 months ago

OK, lets take it from the beginning: what is the problem you want to solve and how this is related to build (offline) selections?

tim-moody commented 7 months ago

The purpose is to calculate the list of articles that will change in the next version without having to create the next version to do so.

This allows mdwiki-cacher to use its cache for articles that have not changed, and only query the source for articles that have changed since the last run.

kelson42 commented 7 months ago

How is this related to this project? As far as I know WP1 selection tools has "nothing" directly to do with MDwiki.

tim-moody commented 7 months ago

It would allow me to know which enwp articles have changed since the last time medicine.tsv was created. My understanding is that WP1 produces a list of articles from enwp for various projects. I am only asking whether that list could optionally be only articles that have changed since the last time the list was produced.

audiodude commented 7 months ago

We only keep track of the time when the article's quality or importance score changed, not when the article itself changed. Is that what you're looking for?

tim-moody commented 7 months ago

I'm looking for changes to the article, not the score. I'm trying to avoid reading every article in the list into my cache if the article hasn't changed since the last time I read it. I was hoping you would have an api for that, but it seems not.

kelson42 commented 7 months ago

Yes, not the role of the project to handle upstream wiki to handle there cache strategy

audiodude commented 7 months ago

With access to enwiki_p, the replica database, you could probably do something with rev ids, but I'm not sure