pypa / bandersnatch

A PyPI mirror client according to PEP 381 http://www.python.org/dev/peps/pep-0381/
Academic Free License v3.0
455 stars 141 forks source link

Exception with UTF-16 encoded requirements.txt file #1386

Closed francescocaponio closed 1 year ago

francescocaponio commented 1 year ago

I have this exception

2023-02-25 22:29:51,784 INFO: considering /requirements/*********-requirements.txt (allowlist_name.py:114)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/site-packages/bandersnatch/main.py", line 231, in <module>
    exit(main())
         ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bandersnatch/main.py", line 227, in main
    return asyncio.run(async_main(args, config))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bandersnatch/main.py", line 192, in async_main
    return await bandersnatch.mirror.mirror(config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bandersnatch/mirror.py", line 963, in mirror
    mirror = BandersnatchMirror(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bandersnatch/mirror.py", line 203, in __init__
    super().__init__(master=master, workers=workers)
  File "/usr/local/lib/python3.11/site-packages/bandersnatch/mirror.py", line 42, in __init__
    self.filters = LoadedFilters(load_all=True)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bandersnatch/filter.py", line 151, in __init__
    self._load_filters(self.ENTRYPOINT_GROUPS)
  File "/usr/local/lib/python3.11/site-packages/bandersnatch/filter.py", line 180, in _load_filters
    plugin_instance = plugin_class()
                      ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bandersnatch/filter.py", line 55, in __init__
    self.initialize_plugin()
  File "/usr/local/lib/python3.11/site-packages/bandersnatch_filter_plugins/allowlist_name.py", line 30, in initialize_plugin
    self.allowlist_package_names = self._determine_unfiltered_package_names()
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bandersnatch_filter_plugins/allowlist_name.py", line 157, in _determine_unfiltered_package_names
    filtered_requirements |= _parse_package_lines(req_fh.readlines())
                                                  ^^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Opening the file with several text files showed normal text content, until I opened it with an hex editor: image

The file has been created on a Ubuntu machine with pip freeze command, the same that usually generates other requirements files in UTF-8, don't really know why that time it was generated in UTF-16 encoding.

Edit: I was remembering it was Ubuntu, but after some search, this error happens with pip freeze > requirements.txt on powershell, thus was a commit made on windows.

Could it be possible to at least not stop the processing, skipping this file, or try different encoding?

My fear is that, since it runs on the background, unattended, if file in such way happen, could stop the packages sync process, creating a mess for the CI/CD pipelines relying on this packages local cache.

Edit: I will make a PR trying to fix or at least skip the problematic file. I made an issue mostly to understand why it never happened to anyone before to have this encoding on file, and then I discovered it is common when using powershell redirect operator.

cooperlees commented 1 year ago

Hi there,

So I looked up what the standard is and pip docs state The following:

Requirements files are utf-8 encoding by default and also support PEP 263 style comments to change the encoding (i.e. # -*- coding: <encoding name> -*-).

I feel we should expand to support the full PEP then if we want to be complete. There are libraries to help with this, I recommend using maybe charset-normalizer.

cooperlees commented 1 year ago

A fix has been merged, and I will try release in next few days.

I would like to see if we can remove the added encodeing.py file and use a library we can keep up to date.

if not, just state here why, but I feel charset-normalizer feels like a maintained version of this. pip tried to not use dependencies (or vendors them) as it is the tool that installs the dependencies in most cases ....

francescocaponio commented 1 year ago

I understand, I also was unsure on how to proceed, I feel like stepping some other people's home by sumbitting PRs. That's why I was asking your confirmation over there.

I wanted to find a solution to make it more reliable, but not sure if this is "future safe".