mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
95 stars 14 forks source link

Bump version to 0.5.0-alpha with some improvements (Part 2) #60

Closed yzqzss closed 1 year ago

yzqzss commented 1 year ago
randomnetcat commented 1 year ago

It would probably be better to split this into smaller PRs?

Also, the change to the on-disk format to make it inconsistent with the original wikiteam tools seems perhaps not ideal (though I don't know for sure whether that's already been done separately in this fork).

yzqzss commented 1 year ago

It would probably be better to split this into smaller PRs?

OK, I'll do that later.

Also, the change to the on-disk format to make it inconsistent with the original wikiteam tools seems perhaps not ideal (though I don't know for sure whether that's already been done separately in this fork).

Writing size and sha1 (they are from API/index.php) to images.txt is necessary to make image/file downloads more stable, to support checking file integrity and to support incremental downloads (incremental downloads are a later work in progress).

images.txt: before: FileName\tFileURL\tUploader after: FileName\tFileURL\tUploader\tSize\tSha1

wikiteam3 seems to have made some formatting changes, such as changing config.txt to config.json. This PR change prevented wikiteam3 from running --resume --images on <0.5.0-alpha Dump, but I think it was worth it, after all the previous version even saved 404/502/429 etc. error response as normal files...

elsiehupp commented 1 year ago

wikiteam3 seems to have made some formatting changes, such as changing config.txt to config.json. This PR change prevented wikiteam3 from running --resume --images on <0.5.0-alpha Dump, but I think it was worth it, after all the previous version even saved 404/502/429 etc. error response as normal files...

IIRC the reason I changed the config file to JSON is that using a standardized, well-supported configuration format like JSON simplifies the client code and makes it easier to debug. I don't think I changed much other than having it use JSON, and Python JSON just serializes and de-serializes the config dictionary as-is.

elsiehupp commented 1 year ago

Regarding versioning: version numbers only really make a difference with dependency management, once we publish this on PyPI. (The README currently instructs users to force-overwrite the install, making version increments irrelevant.)

I originally intended Issue 7 and the prepare-for-publication branch to serve the purpose of getting things ready for PyPI, though I got a bit sidetracked along the way, and the branch is now wildly out of date.

Since PyPI distribution would make Wikiteam3 useful to a lot more people, I do think it's a good medium-term goal, if you'd like to discuss it on Issue 7.

yzqzss commented 1 year ago

It would probably be better to split this into smaller PRs?

I've split this PR: #66 #67 #68 #69 #70
30176d5 and 5edd3ac are dependent on them and need to wait for these PRs to be merged before making new PRs for both of them.

robkam commented 1 year ago

PRs https://github.com/elsiehupp/wikiteam3/pull/66 https://github.com/elsiehupp/wikiteam3/pull/67 https://github.com/elsiehupp/wikiteam3/pull/68 https://github.com/elsiehupp/wikiteam3/pull/69 https://github.com/elsiehupp/wikiteam3/pull/70 now merged.

yzqzss commented 1 year ago

30176d5 and 5edd3ac are dependent on them and need to wait for these PRs to be merged before making new PRs for both of them.

88 created.