vericast / conda-mirror

Mirror upstream conda channels
BSD 3-Clause "New" or "Revised" License
72 stars 60 forks source link

Question about Incremental Syncing #60

Open chenghuzi opened 7 years ago

chenghuzi commented 7 years ago

Thanks! This tool is awesome! Now I want to get the up-to-date version of conda pkgs and, is there possibility that I could check then just download the updated part of pkgs but not all of the pkgs?

pelson commented 7 years ago

cc: @pp-mo

ericdill commented 7 years ago

Hi @Tigeraus --

conda-mirror does not currently support what you are asking for. Such functionality would not be particularly challenging to implement. I'd be happy to guide you through the process of contributing such a feature if you are interested.

If you are not interested in contributing such a feature that's ok too, but I cannot make any guarantees about when I would have time to add this myself.

chenghuzi commented 7 years ago

@ericdill Actually I don't have much time now to do this, but I'm interested in it and I will do it when I'm free. One Way, I think, that could implement this functionality is checking the md5 of all files in the repo, store them somewhere else and then just download those whose md5 values have changed and update the md5 set.

ericdill commented 7 years ago

@Tigeraus that's a good idea. My hesitation here is that repodata.json is the location where the md5 values are stored for local packages. I'd like to avoid inventing another place to store the md5 information from downloaded packages. Let me go check my conda-mirror logs and see if any package has actually ever failed the md5 check. I have not checked those logs in a while. I'll report back later today with what I find.

pp-mo commented 7 years ago

That's interesting. I made a little test of mirroring + changing the config file for selective updating. A couple of things emerged :

Firstly if a package is not in the (current) filter, it will remove the existing downloads, just as if it had been deleted in the upstream channel. That obviously needs adjusting for partial / incremental updates to work.

Secondly Even when you only add one package that was not previously there, the whole of the existing downloads are validated. So when I re-added "pytest" to my mirror of the linux-64bit anaconda channel, it took just seconds to download all those additional binaries, but then about 3 hours to re-validate everything ...

All that is practical stuff, but I think there is also a serious question about the usage and validity of incremental usage : The existing behaviour ensures that the result is just an up-to-date replica of (maybe part of) the upstream channel -- so the meaning of the result is totally unambiguous. But not so if you allow partial updates ... For example, if I mirror anaconda, that is a kind of "curated" resource -- so I can expect that the package versions there are mutually compatible. But a partial update may break that -- and I can't see any automatic way of avoiding those problems, except doing a full update.