jojo2357 / kiwix-zim-updater

A script to check `download.kiwix.org` for updates to your local ZIM library.
GNU General Public License v2.0
77 stars 5 forks source link

Torrent feature #35

Closed evrial closed 1 month ago

evrial commented 8 months ago

Torrent feature would be dope, just not to bash the single server and be more responsible I think .torrent files in target location would be enough to simplify this

DocDrydenn commented 8 months ago

Agreed.

We did discuss this before and should probably think about it again since a lot of these ZIMs are pretty big in size... torrents are probably a better way to download these.

@jojo2357 Most, if not all, torrent clients utilize a "watch" folder of some kind. Providing the script that path to download the '.torrent' file to would work to initiate the download. One downside to this would be that it probably wouldn't be feasible for this script to monitor the download of the actual torrent.

Another issue would be dealing with the completed download of the torrent. The moving of the downloaded ZIM would probably have to be put on the end user to deal with... i.e. most torrent clients allow scripts to be run when a download completes.

jojo2357 commented 8 months ago

It would be quite easy to just place .torrent files in the zim dir, but then you lose a lot of the functionality of the script itself.

I just dont know how to do torrents automatically. There are so many clients, so the only thing that I could do would be add the .torrent to a watch folder. But then how does the script run from there? It needs to know when the dl is complete in order to purge and/or calculate the checksums.

Perhaps a torrent option (as above) would work well in tandem with the min and max size options? So you can dl the small ones via https and then rerun and get the larger ones via torrent?

If you can demonstrate how to download a file via torrent from the command line that is mostly client-agnostic, then I would like to proceed that route. Otherwise it will be a semi-automated inserting of torrent files into a designated folder.

evrial commented 8 months ago

I guess people are smart enough who use torrents so only download torrent files and exit

metametapod commented 3 months ago

Another option is IPFS, e.g.:

https://github.com/kiwix/ipfs-portal

https://github.com/ipfs/distributed-wikipedia-mirror

jojo2357 commented 3 months ago

Not sure what you mean. The wiki data is 3 years out of date https://en.wikipedia-on-ipfs.org/wiki/#distribution-footer

Jaifroid commented 3 months ago

Most torrent downloaders have SHA-based verification built in, so I think it would be enough to download the torrent file, as an option, say, for archives larger than X GB (2 GB?). The mirrorbrain software makes it very easy to get a torrent file (or a magnet link, but the former is most useful).

Probably easiest is that if the script is in "torrent" mode, it would never delete the original ZIM file, and would only download the latest torrent file for the update.

However, there's quite a lot of extra logic required to do this, so it might not be straightforward, and could end up getting messy.

evrial commented 3 months ago

However, there's quite a lot of extra logic required to do this, so it might not be straightforward, and could end up getting messy.

Why? You go to https://download.kiwix.org/zim/wikipedia/ select any url and add .torrent to url and done.

metametapod commented 3 months ago

Not sure what you mean. The wiki data is 3 years out of date https://en.wikipedia-on-ipfs.org/wiki/#distribution-footer

Nevermind, sorry about that. Doesn't look like there's a recent dataset.

This isn't client-agnostic but it may be more straightforward to implement support for aria2c, which allows downloading from both protocols at once:

aria2c "https://download.kiwix.org/zim/wikipedia/Wikipedia_en_computer_maxi_2024-05.zim.torrent" "https://download.kiwix.org/zim/wikipedia/Wikipedia_en_computer_maxi_2024-05.zim" --follow-torrent=true
jojo2357 commented 3 months ago

Why? You go to https://download.kiwix.org/zim/wikipedia/ select any url and add .torrent to url and done.

I think what was implied is the logic to make that fit into the existing script. You are right to say that getting the torrent is very easy, but as previously mentioned, the idea is that the torrenting would happen in the user's preferred software and not thru kzu.

Some features afforded by using a torrent just do not make sense for a CLI app, like pausing and resuming a download on-demand, or

@Jaifroid is saying to add a new flag, which would need to interact with the -x flag, and then also need to integrate with all of the other options since we are using a totally different pathway.

I would be interested in feedback on using aria2c as @metametapod mentioned. I am not in any rush to stop using wget as it is a tried and true utility, but feedback is always appreciated.

evrial commented 3 months ago

Semi-automatic flag feature would be sufficient. Simply download .torrent instead of .zim, why looking for more complex solution?

Jaifroid commented 3 months ago

@jojo2357 Thanks for the clarification, and that's indeed what I meant. @evrial that's also what I meant -- simply download .torrent instead of ZIM if a flag is set (and probably for ZIMs larger than a specified size).

jojo2357 commented 1 month ago

@Jaifroid @evrial @metametapod @DocDrydenn

I pushed changes to the future branch. Please provide feedback and do some testing.

Essentially the -t flag will download a .torrent which will be up to you to manage. I have not tried edge cases, but largely it should work with the other opts like min and max size.

jojo2357 commented 1 month ago

Can i please get some feedback here? Does the script even work in practice with torrenting?

Jaifroid commented 1 month ago

@jojo2357 Oops, sorry, I'll take a look later today. Thanks for taking this idea forward.

jojo2357 commented 1 month ago

Perhaps in torrent mode, kzu can grap all the avail .torrent files for verification (and seeding) similar to what it does to verify library in normal mode. Thoughts?

Jaifroid commented 1 month ago

Perhaps in torrent mode, kzu can grap all the avail .torrent files for verification (and seeding) similar to what it does to verify library in normal mode. Thoughts?

I'm not sure I understand? I'm sorry, ran out of time yesterday, but will try out asap.

jojo2357 commented 1 month ago

when presented with a library untouched by kzu, in normal mode, if you speficy verify library, it will download the sha256 files for the existing files, regardless of if they need updating.

Should the behavior be replicated in torrent mode, ie download all the .torrents even for existing zims for the purpose of verifying and perhaps seeding?

Jaifroid commented 1 month ago

@jojo2357 I've finally had a chance to try out the torrent feature on my own ZIM library. It's great! I can now download the torrent files very quickly and then decide which ones I want to update simply by launching the file. It's great for a use-case where you have masses of files and don't necessarily want to keep everything updated to the latest automatically, but want more control. Also it means the script can do its job very quickly, and then you can leave it to your torrent software to do leisurely/background auto-resumable updates of very large ZIM files (if that's what you want).

So, I think it's a great solution for certain use-cases. It's probably not the main use case for kwixi-zim-updater, but it is certainly one I find valuable.

Just note a bug was exposed with dealing with archives with a "08" month in the filename. See screenshot. The bug doesn't appear to have affected anything in practice.

image

Jaifroid commented 1 month ago

when presented with a library untouched by kzu, in normal mode, if you speficy verify library, it will download the sha256 files for the existing files, regardless of if they need updating.

Should the behavior be replicated in torrent mode, ie download all the .torrents even for existing zims for the purpose of verifying and perhaps seeding?

Personally I don't think we need to download SHA256 for torrents, as torrents have their own SHA and torrent software will validate what it has downloaded against the declared SHA unless the user specifically turns off this behaviour (for example, in qBittorrent). Since we are getting the .torrent from Kiwix, we can trust the declared SHA.

jojo2357 commented 1 month ago

so instead of downloading the sha, should old .zims have their .torrents downloaded for that verification and/or seeding in the same functionality?

jojo2357 commented 1 month ago

@Jaifroid can you specify ur args and commit for that SS?

jojo2357 commented 1 month ago

Nevermind, its our good friend octal here to save the day

Jaifroid commented 1 month ago

so instead of downloading the sha, should old .zims have their .torrents downloaded for that verification and/or seeding in the same functionality?

Hmm. That's a more complicated one to decide. But for me, personally, it would be confusing to have a whole bunch of torrent files suddenly appearing in my ZIM directory and not knowing which ones are for new files I don't currently have. I'm only interested in running a torrent for the purposes of updating my very large files (ones that are quite useless to download via a script due to size and long-running of the download). For that I need to know that the torrents I've received are for updates. Flooding the directory with other torrent files would make the feature less useful IMHO.

jojo2357 commented 1 month ago

I suppose, also some torrents change the .torrent to .torrent.added and I cba to deal with that.

Jaifroid commented 1 month ago

I suppose, also some torrents change the .torrent to .torrent.added and I cba to deal with that.

That might be software dependent. I haven't seen it. But indeed, we don't want to overcomplicate the -t function! I think the KISS principle applies...

jojo2357 commented 1 month ago

Last thing, would a mixed mode (used with the size params, ie small over http, large over torrent) be of use?

Jaifroid commented 1 month ago

@Jaifroid can you specify ur args and commit for that SS?

For the record, I ran ./kiwix-zim-updater.sh -t -p -d /mnt/w/. In retrospect, -p is redundant (it appears) in the -t context.

Jaifroid commented 1 month ago

Last thing, would a mixed mode (used with the size params, ie small over http, large over torrent) be of use?

Unless it's easy to do, it might end up overcomplicating things. User who wants torrents can run with -t and then probably will select only the largest files anyway (time/benefit analysis of using .torrent over simply downloading). They can always then run the script in normal mode with size modifiers as a separate operation to mop-up updates not done over torrent (and if the torrent has already updated a file, running the script in normal mode won't then touch it).

jojo2357 commented 1 month ago

right, running it twice much easier and more explicit than adding extra complexity...

do you have any reservations about the latest update? can you confirm it actually works in a real situation, like does rerunning zku do nothing even with downloading?

any other reservations? did i get the readme right?

Jaifroid commented 1 month ago

do you have any reservations about the latest update? can you confirm it actually works in a real situation, like does rerunning zku do nothing even with downloading?

So I just tested running ./kiwix-zim-updater.sh -p /mnt/w/ (without -t) on the same directory which has the .torrent files in, and it recognizes that torrent files have been downloaded, and therefore doesn't offer to download the corresponding ZIM file conventionally. I think that's a good feature. If I'm selecting manually which of the torrent files to run in my torrent software, I can then delete those I don't want to use torrent software for, and run the script again without -t, and will get all the smaller files updated via simple download.

I ran the updater on my real-world ZIM archive folder. I launched one torrent file I wanted to update (a 4.5GB file) and it just finished updating. In sum, with the tests I've done, I don't have any reservations about the feature. It seems like a really useful addition, and is not over-complicated.

I'll review the README in the PR.