jojo2357 / kiwix-zim-updater

A script to check `download.kiwix.org` for updates to your local ZIM library.
GNU General Public License v2.0
77 stars 5 forks source link

Add new args #13

Closed jojo2357 closed 1 year ago

jojo2357 commented 1 year ago

Can now specify min/max file size, and checksums, and skipping purge

Closes #6 #12 #11 #10

Jaifroid commented 1 year ago

Great PR! On testing it in dry-run mode, setting the -x option doesn't seem to do anything (it still simulates downloading full English Wikipedia, for example). It would be good if the simulation reflected the actual archives to be downloaded. I guess this is because in dry-run mode the file size is not checked?

Jaifroid commented 1 year ago

Further info: I ran the command like this: ./kiwix-zim.sh /mnt/w/ --skip-purge -x 500M. The skip-purge option worked (Purging Old ZIMs - Skipped in red), but in the list of simulated downloads I got:

  Download (mirror) : https://www.mirrorservice.org/sites/download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2023-04.zim
  ✓ Status : *** Simulated ***  ZIM doesn't exist on disk. Downloading...
Jaifroid commented 1 year ago

@jojo2357 Apologies, I was fooled by the fact that the script uses Git commands to reset itself to main before running... So ended up running main instead of the PR without realizing it.

Jaifroid commented 1 year ago

Had to comment out the block that switches back to main, and it works fine now:

  Download (mirror) : https://www.mirrorservice.org/sites/download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2023-04.zim
  ✓ Status : *** Simulated ***  ZIM larger than specified maximum size (maximum: 477Mi, download size: 94Gi). Skipping download.

Apologies for the confusion!

Jaifroid commented 1 year ago

OK, I think this is a legit issue with the PR: at least in the simulated run, it appears to want to purge archives that were not downloaded due to the max-size check (see screenshot). In practice the purge should fail because the new file had not been downloaded (there is a comment # Before we actually purge, we want to check that the new ZIM downloaded and exists.), but the simulation should reflect the fact that the archive won't be purged because not downloaded due to a user option.

image

jojo2357 commented 1 year ago

Yea, the branch switching is a nuisance. If you clone my repo then it is easier to test, but you either have to commit and push to origin or just comment that block iff you make edits. It is easiest to just add a false && to the first if.

As for the simulation lying, I believe i fixed it. In the event that the zim does not exist on disk, but is skipped due to size constraints, it will warn with extra detail in the simulation.

Did you notice that it fetches updates remarkably faster ;)?

jojo2357 commented 1 year ago

Okay, I will update the readme tonight, so don't merge until I get a chance to revise it.

I'm glad to hear of the success, no checksums though? Definitely could use a test. Can your bittorrent intentionally corrupt a download?

Jaifroid commented 1 year ago

I didn't test the checksum option. Will have a go and report back.

Jaifroid commented 1 year ago

Just ran a test on this, and checksum function works fine!

Your comment about BitTorrent corrupting files leads me to wonder whether you intended the checksum function to test files that are not downloaded because they already exist on the disk? Currently of course it only tests files it has downloaded.

Do you intend that to be an option? It could be a useful option to check the integrity of a whole ZIM library against expected checksums as provided by the download server. However, it should probably be a separate issue/PR, and definitely a separate commandline option given the amount of time it could take.

jojo2357 commented 1 year ago

Right now i only intend it as a way to verify the integrity of files dowloaded during the current run. Everything in this pr was easy to hook into the current code flow, and verifying all files doesnt fit nicely into the current flow, as i see it.

If I/we ever address #7, then some redesign will be necessary and would probably make verifying the entire library feasible.

One consideration though: if you are constantly using this script to update, and using the checksum option, after a while you will know all your files were downloaded validly (because they were all updated once or more). What usecase then would be served to have this extra feature?

Jaifroid commented 1 year ago

I agree: definitely a separate issue and more complex to implement. NB I'm not advocating for this particularly: it was only prompted by your mention of BitTorrent.

The use case would be that the user may get downloads of larger files from other sources and applications. Particularly the case in low-bandwidth situations (we recently had an email from a user in Cuba where Kiwix archives are part of "el paquete", a download of offline data that Cubans swap on USB sticks). Such low-bandwidth situations are prevalent throughout much of the developing world outside of large metropolitan centres.

On the other hand, if a user is security-conscious, they probably know how to verify a file for themselves... And this tool is only likely to be used by enthusiasts and expert users.

DocDrydenn commented 1 year ago

Awesome work, guys. Thanks for the help. Just been swamped lately, so I apologize for my delays.

jojo2357 commented 1 year ago

Okay, I will update the readme tonight, so don't merge until I get a chance to revise it.

Oh well, an extra commit will be needed.

As for the other things: it might actually be easy to generate a library.sha256 and then just run sha256sum. Doing it file-by-file may be tricky, but in the current flow it could actually be quite easy 🤔. I wouldnt delete them if it failed, just show the sum output.

Maybe. Im not looking at it right now.

Jaifroid commented 1 year ago

Awesome work, guys. Thanks for the help. Just been swamped lately, so I apologize for my delays.

Thanks for your great script! It would be good to find a way to promote it a bit more. Would you like me to post about it on the Kiwix Mastodon account and/or add a new post about the updates on the Kiwix Reddit (or you might prefer to do the latter)?

DocDrydenn commented 1 year ago

I posted it to a couple of subs on Reddit a while back, but got very little "buzz"... just figured it wasn't something others wanted/needed. That said, feel free to promote it however you wish. I wrote it to serve a need and saw no reason not to share it to others.

DocDrydenn commented 1 year ago

Oh, and sorry about jumping the gun on the merge... I forgot we were waiting on the Readme updates.