pypa / bandersnatch

A PyPI mirror client according to PEP 381 http://www.python.org/dev/peps/pep-0381/
Academic Free License v3.0
448 stars 141 forks source link

Enhancement: Filter packages with very long versions #1228

Open craigatfetch opened 1 year ago

craigatfetch commented 1 year ago

There is currently a package "uselesscaptialquiz" that has a very long version name. So long that sync'ing the package fails on Ubuntu 20.04 because the result package file name is too long for the OS, and the mirror fails.

Request: add filtering option to ignore packages which have version names so long that the OS gives an error when sync'ing.

cooperlees commented 1 year ago

What do you exactly want here? I guess you want automatic catching of the OSError (I guess it is) exception when this occurs and just print an error saying we're ignoring package X cause it's naming/versioning is to long for the storage?

For now to avoid the error. you could just deny list it: https://bandersnatch.readthedocs.io/en/latest/filtering_configuration.html#allowlist-blocklist-filtering-settings

craigatfetch commented 1 year ago

Sure, automatically catching and ignoring that error would work.

The underlying problem is that PyPi doesn’t stop people from creating packages with ridiculously long versions.

Thanks!

Craig

-- Craig Anderson @.***

From: Cooper Lees @.> Date: Friday, September 30, 2022 at 1:58 PM To: pypa/bandersnatch @.> Cc: Anderson, Craig @.>, Author @.> Subject: Re: [pypa/bandersnatch] Enhancement: Filter packages with very long versions (Issue #1228) [External Email]

What do you exactly want here? I guess you want automatic catching of the OSError (I guess it is) exception when this occurs and just print an error saying we're ignoring package X cause it's naming/versioning is to long for the storage?

For now to avoid the error. you could just deny list it: https://bandersnatch.readthedocs.io/en/latest/filtering_configuration.html#allowlist-blocklist-filtering-settingshttps://urldefense.proofpoint.com/v2/url?u=https-3A__bandersnatch.readthedocs.io_en_latest_filtering-5Fconfiguration.html-23allowlist-2Dblocklist-2Dfiltering-2Dsettings&d=DwMCaQ&c=Qwsh1H-X9ypOoLLEcAIltRyC0Dw0FG3Mmyd56ahml5w&r=IeOe_1i-fE4lTGcDyL6SeNFm2I5X2M8gL2A4KBIvJCE&m=tx3oo5yluaJ5Bt9-OXoA43fkLAOhFPeA691enGAmkJ75hQ4SEpCJI_Ky3SzgK9Le&s=0tstGtXhQ8rbBWaOYrUQSigYMlfIJdl_Phr-SdJh4Bc&e=

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pypa_bandersnatch_issues_1228-23issuecomment-2D1264018324&d=DwMCaQ&c=Qwsh1H-X9ypOoLLEcAIltRyC0Dw0FG3Mmyd56ahml5w&r=IeOe_1i-fE4lTGcDyL6SeNFm2I5X2M8gL2A4KBIvJCE&m=tx3oo5yluaJ5Bt9-OXoA43fkLAOhFPeA691enGAmkJ75hQ4SEpCJI_Ky3SzgK9Le&s=kNaCrJwqbi5pX2-KXrSEFlw37_UxkE37SEIxpeb9GMQ&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AVER4OJA3744OGFKRFNUFFTWA5H55ANCNFSM6AAAAAAQ2DKTMQ&d=DwMCaQ&c=Qwsh1H-X9ypOoLLEcAIltRyC0Dw0FG3Mmyd56ahml5w&r=IeOe_1i-fE4lTGcDyL6SeNFm2I5X2M8gL2A4KBIvJCE&m=tx3oo5yluaJ5Bt9-OXoA43fkLAOhFPeA691enGAmkJ75hQ4SEpCJI_Ky3SzgK9Le&s=NBh-gd2qaz_Hmaa686heztaLgLwivrEiOsRTMybWY6M&e=. You are receiving this because you authored the thread.Message ID: @.***>


This email and any files transmitted with it are confidential, and may also be legally privileged. If you are not the intended recipient, you may not review, use, copy, or distribute this message. If you receive this email in error, please notify the sender immediately by reply email and then delete this email.

cooperlees commented 1 year ago

I would accept a PR doing this with appropriate unit testing showing the behavior.

Agree PyPI should be more strict there. Have you search / opened an issue there? i.e. https://github.com/pypi/warehouse/issues

craigatfetch commented 1 year ago

I’ll work on that PR.

And I’ll submit a PyPi issue if it doesn’t already exist.

Regards, Craig

-- Craig Anderson @.***

From: Cooper Lees @.> Date: Friday, September 30, 2022 at 2:04 PM To: pypa/bandersnatch @.> Cc: Anderson, Craig @.>, Author @.> Subject: Re: [pypa/bandersnatch] Enhancement: Filter packages with very long versions (Issue #1228) [External Email]

I would accept a PR doing this with appropriate unit testing showing the behavior.

Agree PyPI should be more strict there. Have you search / opened an issue there? i.e. https://github.com/pypi/warehouse/issueshttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pypi_warehouse_issues&d=DwMCaQ&c=Qwsh1H-X9ypOoLLEcAIltRyC0Dw0FG3Mmyd56ahml5w&r=IeOe_1i-fE4lTGcDyL6SeNFm2I5X2M8gL2A4KBIvJCE&m=0lVGSFDgkmGsfJc2JPRqCStXLfL_lCOzqlxRJfuNBnb8ayPbFDoIeF2pyM7MDJ6V&s=aYksMS-0BJgoKO66xJJmmpVhyGQESjzjbbvoMZ-bDE0&e=

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pypa_bandersnatch_issues_1228-23issuecomment-2D1264022810&d=DwMCaQ&c=Qwsh1H-X9ypOoLLEcAIltRyC0Dw0FG3Mmyd56ahml5w&r=IeOe_1i-fE4lTGcDyL6SeNFm2I5X2M8gL2A4KBIvJCE&m=0lVGSFDgkmGsfJc2JPRqCStXLfL_lCOzqlxRJfuNBnb8ayPbFDoIeF2pyM7MDJ6V&s=kuc9ZlCa1YTs6aI4m9U013EZVarC1uiaGfolw5aCM2s&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AVER4OIF7RSCDOXTEV6YBHLWA5IWDANCNFSM6AAAAAAQ2DKTMQ&d=DwMCaQ&c=Qwsh1H-X9ypOoLLEcAIltRyC0Dw0FG3Mmyd56ahml5w&r=IeOe_1i-fE4lTGcDyL6SeNFm2I5X2M8gL2A4KBIvJCE&m=0lVGSFDgkmGsfJc2JPRqCStXLfL_lCOzqlxRJfuNBnb8ayPbFDoIeF2pyM7MDJ6V&s=wtLaok13TsA92otKfNkq1GXYbltZUIYeMOiwZzInC6c&e=. You are receiving this because you authored the thread.Message ID: @.***>


This email and any files transmitted with it are confidential, and may also be legally privileged. If you are not the intended recipient, you may not review, use, copy, or distribute this message. If you receive this email in error, please notify the sender immediately by reply email and then delete this email.

forky2 commented 1 year ago

I've just run into this problem too. One big problem with it from a banderstatch point of view is that a new mirror will never get past serial 0 and all subsequent executions of bandersnatch mirror will be looping over the same set of todo packages to try (and fail) to get to serial X that was current at the time of the first execution.

Of course, that's the correct behaviour because bandersnatch has failed to get one of the todo packages due to an unhandled exception and I don't think any of us expected this to happen:

2022-10-26 08:41:43,150 INFO: Downloading: https://files.pythonhosted.org/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl (mirror.py:875)
2022-10-26 08:41:43,653 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl (mirror.py:686)
Traceback (most recent call last):
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 662, in sync_release_files
    downloaded_file = await self.download_file(
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 892, in download_file
    with self.storage_backend.rewrite(path, "wb") as f:
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/bandersnatch/src/bandersnatch_storage_plugins/filesystem.py", line 82, in rewrite
    with tempfile.NamedTemporaryFile(
  File "/usr/lib/python3.8/tempfile.py", line 679, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/usr/lib/python3.8/tempfile.py", line 389, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 36] File name too long: '/mnt/mirrors/pypi/web/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/.uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl.7lqjgi8o'
2022-10-26 08:41:43,658 ERROR: Error syncing package: uselesscapitalquiz@14521754 (mirror.py:377)
Traceback (most recent call last):
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 130, in package_syncer
    await self.process_package(package)
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 337, in process_package
    await self.sync_release_files(package)
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 693, in sync_release_files
    raise deferred_exception  # raise the exception after trying all files
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 662, in sync_release_files
    downloaded_file = await self.download_file(
  File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 892, in download_file
    with self.storage_backend.rewrite(path, "wb") as f:
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/bandersnatch/src/bandersnatch_storage_plugins/filesystem.py", line 82, in rewrite
    with tempfile.NamedTemporaryFile(
  File "/usr/lib/python3.8/tempfile.py", line 679, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/usr/lib/python3.8/tempfile.py", line 389, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 36] File name too long: '/mnt/mirrors/pypi/web/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/.uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl.7lqjgi8o'

I'm not sure how best to deal with this:

cooperlees commented 1 year ago

Thanks for sharing your experience.

Here are my thoughts on each of your suggestions:

forky2 commented 1 year ago

I'm really not annoyed about occasionally having to occasionally waste some time fixing problems with this. The amount of time I've saved with this project is immense, and I really like the project.

My suggestion of a problem_packages file was not for the user to create an exception list (no, that would just be like the blocklist); rather I was suggesting that the tool could record the fact that the package had issues.

I don't know how other people use bandersnatch, but I find tracking problems quite difficult. The logs are very verbose! If I run bandersnatch mirror in a screen terminal, then any errors that may have been encountered are long lost above the maximum scrollback. If I have an issue I've got to rerun it and pipe the logs somewhere and grep -v INFO to get rid of all the noise, and just hope that the issue appears.