openzim / zimfarm

Farm operated by bots to grow and harvest new zim files
https://farm.openzim.org
GNU General Public License v3.0
84 stars 25 forks source link

StackExchange Watcher does not correctly grabs new dumps #1041

Open benoit74 opened 4 days ago

benoit74 commented 4 days ago

Due to changes in StackExchange processes, our watcher does not grab new dumps properly.

Originally, all dumps where pushed to https://archive.org/details/stackexchange

Latest dumps have dedicated URLs now:

Looks like we can search for these new identifiers with this URL: https://archive.org/services/search/beta/page_production/?user_query=subject:%22Stack%20Exchange%20Data%20Dump%22%20creator:%22Stack%20Exchange,%20Inc.%22&hits_per_page=1&page=1&sort=date:desc&aggregations=false&client_url=https://archive.org/search?query=subject%3A%22Stack+Exchange+Data+Dump%22+creator%3A%22Stack+Exchange%2C+Inc.%22

benoit74 commented 3 days ago

This is blocked by the fact that Sites.xml is not provided anymore in the datadumps.

I've opened an issue upstream: https://meta.stackexchange.com/questions/404002/sites-xml-is-not-present-anymore-in-stackexchange-data-dumps

benoit74 commented 3 days ago

I've pushed current changes (not complete at all, but already capable to retrieve the most recent dump) to https://github.com/openzim/zimfarm/tree/new_dumps_url ; waiting for SE answer on upstream issue