Unexpected Error occurs and "generating global index page..." take too many time

r00t1900 commented 2 years ago

case

The console output logs:

...
2022-04-07 08:08:54,922 INFO: yolkfolk no longer exists on PyPI (package.py:65)
2022-04-07 08:08:55,083 INFO: yuijfish no longer exists on PyPI (package.py:65)
2022-04-07 08:08:55,401 INFO: yuij-xiaoxiaolog no longer exists on PyPI (package.py:65)
2022-04-07 08:08:55,566 INFO: zju-hitcarder-xuhao no longer exists on PyPI (package.py:65)
2022-04-07 08:08:55,567 INFO: Generating global index page. (mirror.py:483)
2022-04-07 09:31:02,978 INFO: New mirror serial: 13353580 (mirror.py:507)
2022-04-07 09:31:03,218 INFO: 0 packages had changes (mirror.py:1043)
2022-04-07 09:31:03,218 INFO: Writing diff file to mirrored-files (mirror.py:1053)

From the logs, we can see that bandersnatch took almost 90 mins to do generating global index page even 0 packages had changes. This often happen when the bandersnatch run with error like:

2022-04-07 00:00:37,509 ERROR: Error syncing package: pl-nightly@13343947 (mirror.py:363) 
...

After this error happen, bandersnatch will go straight toward to generating global index page and then finish the work. However, you need to rerun bandersnatch for another generating global index page operation( but I don't know why) to remove todo file and then can resume to a normal status.

questions

So here I have some questions:

I have set stop-on-error=Flase, but why the bandersnatch still make a stop-like action when ERROR: Error syncing package
I have set download-mirror, but recently bandersnatch often gives hints like "conducting to next uri" and then download from "https://files.pythonhosted.org", which is much more slower. Why would this happen?
On my previous issue, one of the developer had instructed me to add "generating_global_index=True" to avoid executing "generating global index page" every time. However it did not work since I don't know where should I exactly add this parameter to.
I am now reach about 9.0T data, and I can figure out that the reason why I previous download is only 8.61T is, the bandersnatch error and stop. Because some of the network error, the bandersnatch goes into a weired loop, and go straight to "generating global index page", which make me think it has made it to the end. However this is just a false-end.

r00t1900 commented 2 years ago

One more thing:

If I would like to ignore the prerelease file, what should I do? I've noticed that there are prerelease plugin, is this plugin to enable prelease download or disable prelease download? What I need it to ban the prelease download, would someone give me an explain? thanks.

cooperlees commented 2 years ago

You're an inquisitive one ...

I have set stop-on-error=Flase, but why the bandersnatch still make a stop-like action when ERROR: Error syncing package

When we error, we still log it. Maybe that's the confusion here.

I have set download-mirror, but recently bandersnatch often gives hints like "conducting to next uri" and then download from "https://files.pythonhosted.org/", which is much more slower. Why would this happen?

If the mirror you set does not have the file, I'm pretty sure we fall back. I'd have to read the code to be 100%.

Code will always answer you questions here ...

On my previous issue, one of the developer had instructed me to add "generating_global_index=True" to avoid executing "generating global index page" every time. However it did not work since I don't know where should I exactly add this parameter to.

I ment, we'd have to change the code to support that option. As in do a PR. That option does not exist today.

I am now reach about 9.0T data, and I can figure out that the reason why I previous download is only 8.61T is, the bandersnatch error and stop. Because some of the network error, the bandersnatch goes into a weired loop, and go straight to "generating global index page", which make me think it has made it to the end. However this is just a false-end.

Yes, bandersnatch is designed to be eventual consistent. We don't ever expect a perfect run every time. This is the internet after all.

If I would like to ignore the prerelease file, what should I do? I've noticed that there are prerelease plugin, is this plugin to enable prelease download or disable prelease download? What I need it to ban the prelease download, would someone give me an explain? thanks.

https://bandersnatch.readthedocs.io/en/latest/filtering_configuration.html#prerelease-filtering

It just reads metadata and does not download the pre release versions of a package

Please feel free to submit any updates to documentation if you'd like to help make it more understandable.

pypa / bandersnatch

Unexpected Error occurs and "generating global index page..." take too many time #1108

case

questions