pypa / bandersnatch

A PyPI mirror client according to PEP 381 http://www.python.org/dev/peps/pep-0381/
Academic Free License v3.0
447 stars 141 forks source link

bandersnatch mirror completeness #1622

Open ktry opened 9 months ago

ktry commented 9 months ago

Thanks so much for providing a means to mirror the PyPI repository!

After our latest run of bandersnatch mirror followed by bandersnatch verify --delete --json-update, our mirror is 13.3 TB is size. It was 17.7 TB before we ran the verify --delete operation. We found that some packages were not being updated after many runs of bandersnatch mirror. One such package was poetry. We got it to update with bandersnatch sync poetry before we ran the verify --delete operation.

We are running bandersnatch 6.3.0 and python 3.5.8 and the latest verify operation took 17 days to complete and had a zero exit code. Our mirror appears incomplete compared to the stats reported on pypi.org. How can we assess the completeness of our mirror?

On our local mirror, web/simple/index.html has 371694 . web/simple has 372444 directories and web/json has 357233 directories. The bandersnatch log reports that 1,049,164 files were fetched. https://pypi.org reports 498,484 projects and https://pypi.org/stats reports the total mirror size of 18.2 TB.

/etc/bandersnatch.conf:

[mirror]
directory = /mirror/sites/PyPI
json = true
release-files = true
cleanup = true
master = https://pypi.org
timeout = 20
global-timeout = 1800
workers = 3
hash-index = false
simple-format = ALL
stop-on-error = false
storage-backend = filesystem
verifiers = 3
compare-method = hash
cooperlees commented 9 months ago

Howdy.

Sorry to hear you troubles. You've taken the brute force attempt to fix your errors! But this is dedication (17 days verify ...). I haven't ran a verify since PyPI was around 1TB and have wondered if it's even sane to do anymore.

I think step one is to see what error(s) you're hitting and work through them. Let's change the stop on error config option and do runs reporting what actual errors you're hitting.

stop-on-error = true

DId your verify get any errors too? I can't remember but I think it respects stop-on-error too.

To get a report on completness we could add a report sub command that goes through all JSON meta data and looks for what is missing. It could also sync newer metadata from pypi.org as we walked te filesystem ... Would accept that PR.

ktry commented 9 months ago

Here are the bandersnatch operations that we have run lately: ` # Bandersnatch Fri Nov 10 12:27:16 MST 2023 # 2023-11-1012:27:16 bandersnatch mirror # Bandersnatch Sun Nov 12 09:13:05 MST 2023 _# 2023-11-1209:13:05 bandersnatch mirror --force-check # Bandersnatch Sat Nov 18 08:49:20 MST 2023 _# 2023-11-18_08:49:20 bandersnatch verify --delete --json-update

`

The verify --delete --jason-update log has 2296109 lines and 7313 ERROR: lines. 7283 are for the form:

2023-12-01 07:54:37,648 ERROR: /mirror/sites/PyPI/web/json/normcl.new does not exist - Did not get new JSON metadata (verify.py:68)

The remaining 30 are of the form:

2023-11-26 18:13:02,713 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/21/c8/2b875df3750668fd334c7d6904955d8f0bbfce23603ab6bc6ee88d9e084/fsleyes-1.4.3-py2.py3-none-any.whl (verify.py:175)

or of the form

2023-11-26 20:34:46,553 ERROR: Error syncing package: pytango (verify.py:38)

Here are snippets of all of the reported errors that resulted in tracebacks during the verify op. Is this helpful?

_# 2023-11-18_08:49:20 bandersnatch verify --delete --json-update

2023-11-18 08:49:21,344 INFO: Starting verify for /mirror/sites/PyPI with 3 workers (verify.py:252) 2023-11-18 08:52:07,361 INFO: Parsing shuanpdf (verify.py:125) 2023-11-18 08:52:07,363 INFO: Fetching https://pypi.org/pypi/shuanpdf/json (master.py:149)

/SNIP/

2023-11-26 18:09:56,588 INFO: Fetching https://files.pythonhosted.org/packages/09/06/896687cc1c5098dc5bc6beaaf679a5f7564cb2afc2523f8c06d61e9b874f/fsleyes-1.4.1-py2.py3-none-any.whl (master.py:149) 2023-11-26 18:10:27,905 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/09/06/896687cc1c5098dc5bc6beaaf679a5f7564cb2afc2523f8c06d61e9b874f/fsleyes-1.4.1-py2.py3-none-any.whl (verify.py:175) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify await master.url_fetch(jpkg["url"], pkg_file, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch chunk = await response.content.read(chunk_size) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read await self._wait("read") File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait await waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 18:10:27,996 INFO: Fetching https://files.pythonhosted.org/packages/2e/7e/7cd5ab387eb7f532eff87dd71f6abd71b87ecfeac582b809496eb495bcf3/fsleyes-1.4.1.tar.gz (master.py:149) 2023-11-26 18:11:11,702 INFO: Fetching https://files.pythonhosted.org/packages/58/fc/828b23c7361f4c935391f58d5f77635c70559637023e573e680fd8599b23/fsleyes-1.4.2-py2.py3-none-any.whl (master.py:149) 2023-11-26 18:11:48,651 INFO: Fetching https://files.pythonhosted.org/packages/71/4a/fe3856ee78f61924044bdc9058bb5b6652ea82af90c46aa32c482227e0ae/fsleyes-1.4.2.tar.gz (master.py:149) 2023-11-26 18:12:23,426 INFO: Fetching https://files.pythonhosted.org/packages/21/c8/2b875df3750668fbd334c7d6904955d8f0bbfce23603ab6bc6ee88d9e084/fsleyes-1.4.3-py2.py3-none-any.whl (master.py:149) 2023-11-26 18:13:02,713 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/21/c8/2b875df3750668fbd334c7d6904955d8f0bbfce23603ab6bc6ee88d9e084/fsleyes-1.4.3-py2.py3-none-any.whl (verify.py:175) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify await master.url_fetch(jpkg["url"], pkg_file, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch chunk = await response.content.read(chunk_size) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read await self._wait("read") File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait await waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 18:13:02,767 INFO: Fetching https://files.pythonhosted.org/packages/a9/c0/d3a78eb0dd781d64d9af706f466c42f507a6bec4069bca3e3e32f6bb2ae6/fsleyes-1.4.3.tar.gz (master.py:149) 2023-11-26 18:13:19,152 INFO: Fetching https://files.pythonhosted.org/packages/41/8b/f419746e60721f37d263247c06e8417a72c1650bb35a41d7c1d1beb5c819/fsleyes-1.4.4-py2.py3-none-any.whl (master.py:149)

/SNIP/

2023-11-26 18:43:32,980 INFO: Fetching https://files.pythonhosted.org/packages/e5/e1/254288af765910269ec6f9ea39e222c3d67de84617f79b1e63c4ba6a75c1/MeUtils-2023.11.20.13.42.41-py3-none-any.whl (master.py:149) 2023-11-26 18:44:30,557 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/e5/e1/254288af765910269ec6f9ea39e222c3d67de84617f79b1e63c4ba6a75c1/MeUtils-2023.11.20.13.42.41-py3-none-any.whl (verify.py:175) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify await master.url_fetch(jpkg["url"], pkg_file, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch chunk = await response.content.read(chunk_size) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read await self._wait("read") File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait await waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 18:44:30,566 INFO: Fetching https://files.pythonhosted.org/packages/6a/cc/9895b13fe2203934567a3c010a12cbb96181be4421f77a2162f2ea2529ba/MeUtils-2023.11.20.13.42.41.tar.gz (master.py:149) 2023-11-26 18:45:04,176 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/6a/cc/9895b13fe2203934567a3c010a12cbb96181be4421f77a2162f2ea2529ba/MeUtils-2023.11.20.13.42.41.tar.gz (verify.py:175) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify await master.url_fetch(jpkg["url"], pkg_file, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch chunk = await response.content.read(chunk_size) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read await self._wait("read") File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait await waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 18:45:04,230 INFO: Fetching https://files.pythonhosted.org/packages/f8/d6/6b68ca80f9c9b51b474063acbe86f2fa9146e606d620a3c76e392eb6f7eb/MeUtils-2023.11.20.13.43.23-py3-none-any.whl (master.py:149) 2023-11-26 18:45:21,930 INFO: Fetching https://files.pythonhosted.org/packages/c7/f0/433c3bb165d2e0a39bfa2b5c446de67fd696e32299f3a96b1b5352b5fcba/MeUtils-2023.11.20.13.43.23.tar.gz (master.py:149) 2023-11-26 18:45:48,489 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/c7/f0/433c3bb165d2e0a39bfa2b5c446de67fd696e32299f3a96b1b5352b5fcba/MeUtils-2023.11.20.13.43.23.tar.gz (verify.py:175) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify await master.url_fetch(jpkg["url"], pkg_file, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch chunk = await response.content.read(chunk_size) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read await self._wait("read") File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait await waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 18:45:48,550 INFO: Fetching https://files.pythonhosted.org/packages/d2/40/0d3b2636e4057a599b548aa0ec510e0c78650389a348036b7833490a8611/MeUtils-2023.11.20.13.50.9-py3-none-any.whl (master.py:149) 2023-11-26 18:45:49,792 INFO: Fetching https://files.pythonhosted.org/packages/32/af/579db493ffa5c4df0a9333f76d4a71f153bebead7bdef47ec28e935f2e13/MeUtils-2023.11.20.13.50.9.tar.gz (master.py:149)

/SNIP/

2023-11-26 19:29:47,436 INFO: Fetching https://files.pythonhosted.org/packages/e8/e0/6b7668c4a41e2d129514321ad1343e99347771a6278085fd2e4ee4b5ff81/deepforest-1.2.2-py3-none-any.whl (master.py:149) 2023-11-26 19:30:07,580 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/e8/e0/6b7668c4a41e2d129514321ad1343e99347771a6278085fd2e4ee4b5ff81/deepforest-1.2.2-py3-none-any.whl (verify.py:175) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify await master.url_fetch(jpkg["url"], pkg_file, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch async with self.session.get(url) as response: File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter self._resp = await self._coro File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request await resp.start(conn) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start message, payload = await self._protocol.read() # type: ignore File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read await self._waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 19:30:07,640 INFO: Fetching https://files.pythonhosted.org/packages/17/18/c8969eab432faa19508877fdfbf2ab2852d02bcc5b5d7c4203b81586ab26/deepforest-1.2.2.tar.gz (master.py:149) 2023-11-26 19:30:28,679 INFO: Fetching https://files.pythonhosted.org/packages/ed/9e/e007b234e72a83f3f15233c77d5c9311d3181c567ecf5e3ef7dba95d85e4/deepforest-1.2.3-py3-none-any.whl (master.py:149) 2023-11-26 19:30:44,727 INFO: Fetching https://files.pythonhosted.org/packages/c9/b7/15138ed10b1480e20e85e1947ce6d7b217e250c67a64449419bd4039e8b7/deepforest-1.2.3.tar.gz (master.py:149) 2023-11-26 19:31:20,503 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/c9/b7/15138ed10b1480e20e85e1947ce6d7b217e250c67a64449419bd4039e8b7/deepforest-1.2.3.tar.gz (verify.py:175) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify await master.url_fetch(jpkg["url"], pkg_file, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch chunk = await response.content.read(chunk_size) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read await self._wait("read") File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait await waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 19:31:20,569 INFO: Fetching https://files.pythonhosted.org/packages/7f/3f/12427d5153e4f9b7321f175713fcc6268f7493d3cae92f2febb26f45a4c3/deepforest-1.2.4-py3-none-any.whl (master.py:149) 2023-11-26 19:31:50,354 INFO: Fetching https://files.pythonhosted.org/packages/ed/3d/0092384e54dd868c48f56d3eed1bbab1675df5598ca1a66f183156dca7c5/deepforest-1.2.4.tar.gz (master.py:149)

/SNIP/

2023-11-26 19:57:14,817 INFO: Fetching https://files.pythonhosted.org/packages/2f/f4/97bd5e9d29f404b1ebbf33877b90a20f42a33554e2aa277922432395b397/unitem-1.2.6-py2.py3-none-any.whl (master.py:149) 2023-11-26 19:57:35,810 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/2f/f4/97bd5e9d29f404b1ebbf33877b90a20f42a33554e2aa277922432395b397/unitem-1.2.6-py2.py3-none-any.whl (verify.py:175) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify await master.url_fetch(jpkg["url"], pkg_file, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch chunk = await response.content.read(chunk_size) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read await self._wait("read") File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait await waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 19:57:35,884 INFO: Fetching https://files.pythonhosted.org/packages/3b/47/047f6ce12947e57cded1cd3579ea3fa8b2b15d06e753fc02a6522598db88/unitem-1.2.6-py3.8.egg (master.py:149) 2023-11-26 19:57:50,047 INFO: Fetching https://files.pythonhosted.org/packages/c7/a2/f4881a76703671bace3524f84d64d65fa0766fc16d207fb778ad99e5b3ed/unitem-1.2.6.tar.gz (master.py:149) 2023-11-26 19:57:53,309 ERROR: Error syncing package: unitem (verify.py:38) NoneType: None 2023-11-26 19:57:53,309 INFO: Finished validating unitem (verify.py:198)

/SNIP/

File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch async with self.session.get(url) as response: File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter self._resp = await self._coro File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request await resp.start(conn) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start message, payload = await self._protocol.read() # type: ignore File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read await self._waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 20:24:16,515 INFO: Fetching https://files.pythonhosted.org/packages/72/8a/2c078705d8da1c91724345912d77a6615318cb44eb387e0ff59dfe13f7f0/pytango-9.4.1rc1-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (master.py:149) 2023-11-26 20:24:39,432 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/72/8a/2c078705d8da1c91724345912d77a6615318cb44eb387e0ff59dfe13f7f0/pytango-9.4.1rc1-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (verify.py:175) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify await master.url_fetch(jpkg["url"], pkg_file, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch chunk = await response.content.read(chunk_size) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read await self._wait("read") File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait await waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 20:24:39,492 INFO: Fetching https://files.pythonhosted.org/packages/ef/ff/ddfd7213c79601f41a8635ae3af75336c7299ca94ba4553b187149b312f6/pytango-9.4.1rc1-cp36-cp36m-manylinux_2_17_i686.manylinux2014_i686.whl (master.py:149) 2023-11-26 20:25:37,691 INFO: Fetching https://files.pythonhosted.org/packages/56/58/79abb1870d26bd78ae017fe81e46a659bcd63aeb3e190603553a0d25f77e/pytango-9.4.1rc1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (master.py:149) 2023-11-26 20:26:03,133 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/56/58/79abb1870d26bd78ae017fe81e46a659bcd63aeb3e190603553a0d25f77e/pytango-9.4.1rc1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (verify.py:175) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify await master.url_fetch(jpkg["url"], pkg_file, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch chunk = await response.content.read(chunk_size) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read await self._wait("read") File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait await waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 20:26:03,211 INFO: Fetching https://files.pythonhosted.org/packages/a5/3e/d98fb4b02f0c05d777b0ed4c664934757bb06fb5a7f25034c12843e4ce6b/pytango-9.4.1rc1-cp36-cp36m-win32.whl (master.py:149) 2023-11-26 20:26:05,801 INFO: Fetching https://files.pythonhosted.org/packages/30/fc/b830a9d2e4b6a03889180a81df133c52e34dd289e78805b1be9b7f5fe483/pytango-9.4.1rc1-cp36-cp36m-win_amd64.whl (master.py:149) 2023-11-26 20:26:07,940 INFO: Fetching https://files.pythonhosted.org/packages/51/f5/8b56ac422444dd2a27ade4799fd3aeb9c2fef2307c8f7dafadc87b54fc2f/pytango-9.4.1rc1-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (master.py:149) 2023-11-26 20:26:30,764 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/51/f5/8b56ac422444dd2a27ade4799fd3aeb9c2fef2307c8f7dafadc87b54fc2f/pytango-9.4.1rc1-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (verify.py:175) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify await master.url_fetch(jpkg["url"], pkg_file, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch chunk = await response.content.read(chunk_size) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read await self._wait("read") File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait await waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 20:26:30,855 INFO: Fetching https://files.pythonhosted.org/packages/6f/68/a7166d9406c90d1a707e3bf15671faba0683e807d3910194f9d57a9e688c/pytango-9.4.1rc1-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (master.py:149) 2023-11-26 20:26:58,134 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/6f/68/a7166d9406c90d1a707e3bf15671faba0683e807d3910194f9d57a9e688c/pytango-9.4.1rc1-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (verify.py:175) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify await master.url_fetch(jpkg["url"], pkg_file, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch chunk = await response.content.read(chunk_size) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read await self._wait("read") File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait await waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-26 20:26:58,207 INFO: Fetching https://files.pythonhosted.org/packages/e0/f5/fdc1a5fa1c9ea204316d39dd6e7051a7553ea6be4d4d9d2d1029d0c0880f/pytango-9.4.1rc1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (master.py:149) 2023-11-26 20:27:37,781 INFO: Fetching https://files.pythonhosted.org/packages/71/9a/26b822f72747aedb03216181626e8eb66ff358b91d6235c0a6159496cf65/pytango-9.4.1rc1-cp37-cp37m-win32.whl (master.py:149)

/SNIP/

self._resp = await self._coro

File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request await resp.start(conn) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start message, payload = await self._protocol.read() # type: ignore File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read await self._waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket 2023-11-30 13:37:22,869 INFO: Finished validating micro-py (verify.py:198) 2023-11-30 13:37:22,870 INFO: Parsing aiohttp-dynamic (verify.py:125) 2023-11-30 13:37:22,870 INFO: Fetching https://pypi.org/pypi/aiohttp-dynamic/json (master.py:149) 2023-11-30 13:37:26,384 ERROR: Error syncing package: aiohttp-dynamic (verify.py:38) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 969, in _wrap_create_connection return await self._loop.create_connection(*args, **kwargs) # type: ignore # noqa File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection transport, protocol = await self._create_connection_transport( File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1080, in _create_connection_transport await waiter File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/selector_events.py", line 846, in _read_ready__data_received data = self._sock.recv(self.max_size) ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify await get_latest_json(master, json_full_path, executor, args.delete) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json await master.url_fetch(url, new_json_path, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch async with self.session.get(url) as response: File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter self._resp = await self._coro File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 520, in _request conn = await self._connector.connect( File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 535, in connect proto = await self._create_connection(req, traces, timeout) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 890, in _createconnection , proto = await self._create_proxy_connection(req, traces, timeout) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 1139, in _create_proxy_connection transport, proto = await self._wrap_create_connection( File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 975, in _wrap_create_connection raise client_error(req.connection_key, exc) from exc aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pypi.org:443 ssl:default [Connection reset by peer] 2023-11-30 13:37:27,073 INFO: Finished validating aiohttp-dynamic (verify.py:198) 2023-11-30 13:37:27,073 INFO: Parsing threadactive (verify.py:125) 2023-11-30 13:37:27,073 INFO: Fetching https://pypi.org/pypi/threadactive/json (master.py:149) 2023-11-30 13:37:30,587 ERROR: Error syncing package: threadactive (verify.py:38) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 969, in _wrap_create_connection return await self._loop.create_connection(*args, **kwargs) # type: ignore # noqa File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection transport, protocol = await self._create_connection_transport( File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1080, in _create_connection_transport await waiter File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/selector_events.py", line 846, in _read_ready__data_received data = self._sock.recv(self.max_size) ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify await get_latest_json(master, json_full_path, executor, args.delete) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json await master.url_fetch(url, new_json_path, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch async with self.session.get(url) as response: File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter self._resp = await self._coro File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 520, in _request conn = await self._connector.connect( File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 535, in connect proto = await self._create_connection(req, traces, timeout) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 890, in _createconnection , proto = await self._create_proxy_connection(req, traces, timeout) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 1139, in _create_proxy_connection transport, proto = await self._wrap_create_connection( File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 975, in _wrap_create_connection raise client_error(req.connection_key, exc) from exc aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pypi.org:443 ssl:default [Connection reset by peer] 2023-11-30 13:37:31,075 INFO: Finished validating threadactive (verify.py:198) await waiter File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/selector_events.py", line 846, in _read_ready__data_received data = self._sock.recv(self.max_size) ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify await get_latest_json(master, json_full_path, executor, args.delete) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json await master.url_fetch(url, new_json_path, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch async with self.session.get(url) as response: File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter self._resp = await self._coro File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 520, in _request conn = await self._connector.connect( File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 535, in connect proto = await self._create_connection(req, traces, timeout) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 890, in _createconnection , proto = await self._create_proxy_connection(req, traces, timeout) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 1139, in _create_proxy_connection transport, proto = await self._wrap_create_connection( File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 975, in _wrap_create_connection raise client_error(req.connection_key, exc) from exc aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pypi.org:443 ssl:default [Connection reset by peer] 2023-11-30 13:37:50,617 INFO: Finished validating templateapp (verify.py:198) 2023-11-30 13:37:50,617 INFO: Parsing pyemailtracker (verify.py:125) 2023-11-30 13:37:50,617 INFO: Fetching https://pypi.org/pypi/pyemailtracker/json (master.py:149) 2023-11-30 13:37:52,875 INFO: Finished validating pyemailtracker (verify.py:198) 2023-11-30 13:37:52,876 INFO: Parsing hitomi (verify.py:125) 2023-11-30 13:37:52,876 INFO: Fetching https://pypi.org/pypi/hitomi/json (master.py:149) 2023-11-30 13:37:54,113 INFO: Finished validating hitomi (verify.py:198) 2023-11-30 13:37:54,114 INFO: Parsing power-profiler (verify.py:125) 2023-11-30 13:37:54,114 INFO: Fetching https://pypi.org/pypi/power-profiler/json (master.py:149) 2023-11-30 13:37:54,737 INFO: Finished validating power-profiler (verify.py:198) 2023-11-30 13:37:54,738 INFO: Parsing requests-lb (verify.py:125) 2023-11-30 13:37:54,738 INFO: Fetching https://pypi.org/pypi/requests-lb/json (master.py:149) 2023-11-30 13:37:55,317 INFO: Finished validating requests-lb (verify.py:198) 2023-11-30 13:37:55,318 INFO: Parsing overlap (verify.py:125) 2023-11-30 13:37:55,318 INFO: Fetching https://pypi.org/pypi/overlap/json (master.py:149) 2023-11-30 13:37:55,403 ERROR: Error syncing package: overlap (verify.py:38) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify await get_latest_json(master, json_full_path, executor, args.delete) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json await master.url_fetch(url, new_json_path, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch async with self.session.get(url) as response: File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter self._resp = await self._coro File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request await resp.start(conn) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start message, payload = await self._protocol.read() # type: ignore File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read await self._waiter aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer 2023-11-30 13:37:59,042 INFO: Finished validating overlap (verify.py:198)

/SNIP/

2023-11-30 13:38:14,897 INFO: Fetching https://pypi.org/pypi/datafilter/json (master.py:149) 2023-11-30 13:38:15,031 ERROR: Error syncing package: datafilter (verify.py:38) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify await get_latest_json(master, json_full_path, executor, args.delete) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json await master.url_fetch(url, new_json_path, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch async with self.session.get(url) as response: File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter self._resp = await self._coro File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request await resp.start(conn) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start message, payload = await self._protocol.read() # type: ignore File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read await self._waiter aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer 2023-11-30 13:38:15,583 INFO: Finished validating datafilter (verify.py:198) 2023-11-30 13:38:15,583 INFO: Parsing monthly-returns-heatmap (verify.py:125)

/SNIP/

2023-11-30 13:38:28,729 ERROR: Error syncing package: setuptools-cython (verify.py:38) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 969, in _wrap_create_connection return await self._loop.create_connection(*args, **kwargs) # type: ignore # noqa File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection transport, protocol = await self._create_connection_transport( File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1080, in _create_connection_transport await waiter File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/selector_events.py", line 846, in _read_ready__data_received data = self._sock.recv(self.max_size) ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify await get_latest_json(master, json_full_path, executor, args.delete) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json await master.url_fetch(url, new_json_path, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch async with self.session.get(url) as response: File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter self._resp = await self._coro File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 520, in _request conn = await self._connector.connect( File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 535, in connect proto = await self._create_connection(req, traces, timeout) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 890, in _createconnection , proto = await self._create_proxy_connection(req, traces, timeout) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 1139, in _create_proxy_connection transport, proto = await self._wrap_create_connection( File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 975, in _wrap_create_connection raise client_error(req.connection_key, exc) from exc aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pypi.org:443 ssl:default [Connection reset by peer] 2023-11-30 13:38:28,921 INFO: Finished validating setuptools-cython (verify.py:198) 2023-11-30 13:38:28,921 INFO: Parsing oog (verify.py:125) 2023-11-30 13:38:28,922 INFO: Fetching https://pypi.org/pypi/oog/json (master.py:149) 2023-11-30 13:38:32,436 ERROR: Error syncing package: oog (verify.py:38) Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 969, in _wrap_create_connection return await self._loop.create_connection(*args, **kwargs) # type: ignore # noqa File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection transport, protocol = await self._create_connection_transport( File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1080, in _create_connection_transport await waiter File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/selector_events.py", line 846, in _read_ready__data_received data = self._sock.recv(self.max_size) ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify await get_latest_json(master, json_full_path, executor, args.delete) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json await master.url_fetch(url, new_json_path, executor) File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch async with self.session.get(url) as response: File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter self._resp = await self._coro File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 520, in _request conn = await self._connector.connect( File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 535, in connect proto = await self._create_connection(req, traces, timeout) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 890, in _createconnection , proto = await self._create_proxy_connection(req, traces, timeout) File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 1139, in _create_proxy_connection transport, proto = await self._wrap_create_connection( File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 975, in _wrap_create_connection raise client_error(req.connection_key, exc) from exc aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pypi.org:443 ssl:default [Connection reset by peer] 2023-11-30 13:38:32,674 INFO: Finished validating oog (verify.py:198)

cooperlees commented 9 months ago

So the verify seems to be getting a lot of connection errors and timeouts - What kind of internet connection are you running bandersnatch on?

It would be nice to maybe go slower and reduce these timeouts and errors I think before we can worry about your consistency ...

Have you tried 2 or 1 workers and see if you get less timeouts?

workers = 2

Maybe the default timeout of 10 seconds isn't enough either? This all dependes on the connection you're on, but it shoudl be

timeout = 10

If you could try and sync with that + enable stop on error (as suggested above) and do a run I'd be interested to see what you hit. Please also run with --debug if you can. That might help show us something to work off. That would look something like:

bandersnatch --debug mirror
ktry commented 9 months ago

I started a bandersnatch mirror job last night at 8 pm. It finished at 1 pm today with a zero exit status. The mirror grew by 59.9G. There were 2314 Fetching metadata lines in the 15788 line logfile and no ERROR lines. There are 11142 Downloading lines in the log.

Our package listing increased by 974 to a total of 372688. The last-modified date is 20231205T03:43:56.

Our internet connection throttles down to 48 Mbps after initial bursts of 200+Mbps.

Since there were no errors or timeouts, why did it complete with only 372688 total packages present on our mirror?

Will --debug mirror be helpful when there are no timeouts or errors?

First Fetching: 2023-12-04 20:43:57,002 INFO: Fetching metadata for package: apimakesens-python (serial 13609424) (package.py:58)

First Downloading: 2023-12-04 20:44:08,869 INFO: Downloading: https://files.pythonhosted.org/packages/c8/85/959e0ff82501b637e6e1541d5c7600d0eb2b79986184955582a149fcfb5c/prettyPlot-0.0.10-py3-none-any.whl (mirror.py:875)

Last Fetching: 2023-12-05 12:55:28,507 INFO: Fetching metadata for package: zytlib (serial 13609589) (package.py:58)

Last Downloading (and last lines in logfile): https://files.pythonhosted.org/packages/8a/e0/f3ef24673dc17b52112bb9bc7384839b2ddc35e82ff18bd765ea53c54eff/zxkane.cdk-construct-simple-nat-0.2.628.tar.gz 2023-12-05 12:56:35,836 INFO: Storing index page(s): zxkane-cdk-construct-simple-nat - in /mirror/sites/PyPI/web/simple/zxkane-cdk-construct-simple-nat (mirror.py:698) 2023-12-05 12:57:18,028 INFO: Storing index page(s): zuul - in /mirror/sites/PyPI/web/simple/zuul (mirror.py:698) 2023-12-05 12:57:18,156 INFO: Generating global index page. (simple.py:260) 2023-12-05 13:01:03,482 INFO: New mirror serial: 13646864 (mirror.py:472) 2023-12-05 13:01:03,640 INFO: 1919 packages had changes (mirror.py:990) 2023-12-05 13:01:03,859 INFO: Writing diff file to mirrored-files (mirror.py:1000)

ktry commented 9 months ago

I've run bandersnatch mirror several times without error, but it only seems to fetch a few hundred projects for each run. I how have 374680 out of the 500,508 projects listed on pypi.org. I just started up a new run and the todo file only had 7276 entries. Since I'm not getting errors or timeouts at this point, what can I do to address the consistency? Thanks!

cooperlees commented 9 months ago

Sadly, the only options now are very expensive. They are:

ktry commented 9 months ago

Thanks for that clarification! I'll do the force-check and if I start getting errors or timeouts, I'll start the debug process you outlined above.

ktry commented 8 months ago

One thing that I noticed is that web/simple/index.html is not updated as packages are synced with bandersnatch mirror --force-check. If bandersnatch doesn't finish gracefully, then web/simple/index.html could be out of sync.

Here are some statistics with bandersnatch mirror --force-check running for six days:

# grep -c href web/simple/index.html
375642
# find web/simple -maxdepth 1 -type d -newer web/simple/index.html | wc -l
271275
# awk '/ERROR/ {e++} /Fetching/ {f++} /Downloading/ {d++} /Storing/ {s++} END { printf("ERROR=%d, Fetching=%d, Storing=%d, Download=%d\n", e, f, s, d) }' bandersnatch.out
ERROR=2, Fetching=296984, Storing=287573, Download=1250239

I have high hopes that if bandersnatch finishes gracefully, that web/simple/index.html will have a lot more hrefs. And if not, I can write a tool to regenerate it.

cooperlees commented 8 months ago

Yeah, sadly, index.html is generated at the end of the run. Since the mirror is getting so big these days, I'd happily take a PR to periodically write out the global index.html during a run ... But it would have to be enabled by a config var with the default off I feel.

ktry commented 8 months ago

The bandersnatch mirror --force-check just finished and things are looking pretty good. The todo file has 14812 entries after bandersnatch finished and web/simple/index.html has 501461 hrefs.

Here are the stats from the todo and logfile. I'll try doing a normal bandersnatch mirror to see if it picks up any more packages.

TODO=14812 ERROR=4, Fetching=501089, Storing=486277, Download=2059037

The final log entries are:

2023-12-25 18:18:05,162 INFO: Downloading: https://files.pythonhosted.org/packages/94/22/c2ad4e731c3795db8acca6ea4c03d969477a97f05d2dd12ef50de59571aa/zzq_string_sum-0.4.0.tar.gz (mirror.py:875)
2023-12-25 18:18:05,227 INFO: Storing index page(s): zzq-string-sum - in /mirror/sites/PyPI/web/simple/zzq-string-sum (mirror.py:698)
2023-12-25 18:18:05,317 INFO: Generating global index page. (simple.py:260)
2023-12-25 18:28:15,083 INFO: 486277 packages had changes (mirror.py:990)
2023-12-25 18:29:00,593 INFO: Writing diff file to mirrored-files (mirror.py:1000)

The two additional errors are filename too long errors:

2023-12-25 11:04:54,546 INFO: Downloading: https://files.pythonhosted.org/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl (mirror.py:875)
2023-12-25 11:04:54,615 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl (mirror.py:686)
Traceback (most recent call last):
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 662, in sync_release_files
    downloaded_file = await self.download_file(
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 892, in download_file
    with self.storage_backend.rewrite(path, "wb") as f:
  File "/root/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch_storage_plugins/filesystem.py", line 82, in rewrite
    with tempfile.NamedTemporaryFile(
  File "/root/.pyenv/versions/3.8.5/lib/python3.8/tempfile.py", line 541, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/root/.pyenv/versions/3.8.5/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 36] File name too long: '/mirror/sites/PyPI/web/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/.uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl.203asete'
2023-12-25 11:04:54,691 INFO: Downloading: https://files.pythonhosted.org/packages/57/79/21b676698665e561d5320dad7e6d94685b429ee0179671284a9cf3cd42c4/usearch-0.22.0-cp39-cp39-manylinux_2_28_x86_64.whl (mirror.py:875)
2023-12-25 11:04:54,701 INFO: Downloading: https://files.pythonhosted.org/packages/cc/50/82753aa766ef30414fce227894e0495ac93ee4f1f3f44a2c7e9c88c79c55/uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593.tar.gz (mirror.py:875)
2023-12-25 11:04:54,778 ERROR: Error syncing package: uselesscapitalquiz@14521754 (mirror.py:377)
Traceback (most recent call last):
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 130, in package_syncer
    await self.process_package(package)
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 337, in process_package
    await self.sync_release_files(package)
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 693, in sync_release_files
    raise deferred_exception  # raise the exception after trying all files
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 662, in sync_release_files
    downloaded_file = await self.download_file(
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 892, in download_file
    with self.storage_backend.rewrite(path, "wb") as f:
  File "/root/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch_storage_plugins/filesystem.py", line 82, in rewrite
    with tempfile.NamedTemporaryFile(
  File "/root/.pyenv/versions/3.8.5/lib/python3.8/tempfile.py", line 541, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/root/.pyenv/versions/3.8.5/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 36] File name too long: '/mirror/sites/PyPI/web/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/.uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl.203asete'
cooperlees commented 8 months ago

Ahh, The long name problem. We've discussed in https://github.com/pypa/bandersnatch/issues/1228 and I feel we should maybe soft error (report and skip) that due to the file system limitations we're skipping this package. I but I also get this is not explicit and evil. Maybe it should be a config option the owner(s) of this bandersnatch instance can choose. As stated elsewhere I'd accept this PR.

Ideally we need PyPI to not allow package names this long.

ktry commented 8 months ago

Another run of bandersnatch mirror has some filename too long errors. So I added the blocklist_project plugin to filter out uselesscapitalquiz as described in comment-9 issue1100 and now bandersnatch mirror completed and there is no todo file. Here are the stats:

TODO=0 ERROR=0, Fetching=14800, Storing=0, Download=0
grep -c href web/simple/index.html
501469
Repo Size = 17.3T

That's pretty close to the 503,186 projects reported on pypi.org. I'm happy.