Incremental Synchronization Issue with Bandersnatch

lxyeternal commented 7 months ago

I am currently using bandersnatch for mirroring PyPI and have encountered an issue regarding incremental synchronization. I want to set up my bandersnatch mirror to only sync new packages added to pypi.org. For packages that have been removed from pypi.org, do not delete these packages from the local mirror during synchronization. In short, only perform incremental backups without deleting any packages.

how to configure bandersnatch.conf to achieve this?

cooperlees commented 7 months ago

You're in luck. bandersnatch does not delete unless you run a bandersnatch verify. So you get that by default.

We do not have a feature to only take new packages created/added on PyPI to day. But I am not sure you mean this. I would take a PR to do so, but I don't know the cleanest way. I guess pull down the fill mirror list via the XMLRPC call we do and save all the package names and use that as your start point. Then from there compare to the original list and make that an allow list maybe?

This would need to be some sort of filter plugin to be accepted.

lxyeternal commented 7 months ago

Thank you very much. I only want to mirror all packages from the pypi.org. My target is to build a comprehensive dataset of python registry for research.

allamiro commented 1 month ago

@lxyeternal and @cooperlees,

Another approach may be ..to consider is using a local SQLite database to track package metadata. During each sync, compare PyPI's current metadata with the database to identify new or updated packages. Download only those packages and update the database without deleting any local packages. This method simplifies incremental synchronization and ensures no historical data is lost ... Let me know what you both think

cooperlees commented 1 month ago

I'd need more information here on implmentation and the goals with this being off by default as most use cases would not benefit from this addition.

Also, how would you detect bad data from failed runs (crashes) etc. and be able to re-sync the SQLite Database if this did happen? This opens up a new data store to keep clean and up to date. State is hard.

allamiro commented 1 month ago

I'd need more information here on implmentation and the goals with this being off by default as most use cases would not benefit from this addition.

Also, how would you detect bad data from failed runs (crashes) etc. and be able to re-sync the SQLite Database if this did happen? This opens up a new data store to keep clean and up to date. State is hard.

I appreciate the feedback and acknowledge the valid concerns raised regarding the implementation and goals for incremental synchronization with Bandersnatch. I must clarify that I misspoke earlier regarding dirsync upon reviewing its documentation .. it appears it may not be suitable for our needs. Instead, there are other Python libraries that could potentially be leveraged, or we might consider developing a custom script. Specifically, pyrsync offers robust functionality for incremental file synchronization .. which might be more appropriate for our use case. I will continue researching more to find the best possible solution and ensure we address all these concerns effectively

pypa / bandersnatch

Incremental Synchronization Issue with Bandersnatch #1663