egg_info comand fails parsing manifest template with Python 3 and SVN working dir

ghost commented 10 years ago

Originally reported by: SpotlightKid (Bitbucket: SpotlightKid, GitHub: SpotlightKid)

I have a project (python-rtmidi), which uses setuptools in the setup.py file and I build in a SVN checkout. I have a MANIFEST.in template to include examples and gererated files in the source distribution. python setup.py sdist fails at the egg_info command step with this exception:

reading manifest template 'MANIFEST.in'
Traceback (most recent call last):
  File "setup.py", line 151, in <module>
    **setup_opts
  File "/usr/lib64/python3.3/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib64/python3.3/distutils/dist.py", line 929, in run_commands
    self.run_command(cmd)
  File "/usr/lib64/python3.3/distutils/dist.py", line 948, in run_command
    cmd_obj.run()
  File "/home/chris/lib/virtualenvs/rtmidi33/lib/python3.3/site-packages/setuptools/command/sdist.py", line 86, in run
    self.run_command('egg_info')
  File "/usr/lib64/python3.3/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib64/python3.3/distutils/dist.py", line 948, in run_command
    cmd_obj.run()
  File "/home/chris/lib/virtualenvs/rtmidi33/lib/python3.3/site-packages/setuptools/command/egg_info.py", line 188, in run
    self.find_sources()
  File "/home/chris/lib/virtualenvs/rtmidi33/lib/python3.3/site-packages/setuptools/command/egg_info.py", line 231, in find_sources
    mm.run()
  File "/home/chris/lib/virtualenvs/rtmidi33/lib/python3.3/site-packages/setuptools/command/egg_info.py", line 301, in run
    self.read_template() 
  File "/usr/lib64/python3.3/distutils/command/sdist.py", line 308, in read_template
    self.filelist.process_template_line(line)
  File "/usr/lib64/python3.3/distutils/filelist.py", line 124, in process_template_line
    if not self.exclude_pattern(pattern, anchor=1):
  File "/usr/lib64/python3.3/distutils/filelist.py", line 236, in exclude_pattern
    if pattern_re.search(self.files[i]):
TypeError: can't use a string pattern on a bytes-like object

After quite some time spent tracking this down, I noticed that the file list generated by setuptools.commands.egg_info.find_sources() contains paths as bytes objects, which makes the exclude_pattern method (see traceback) choke. These entries in the filelist are produced by the setuptools.svn_utils.svn_finder function, which calls fsencode on every path it yields. fsencode encodes the unicode (Python 2) ressp. str (Python 3) path entries with the default encoding, turning them into str (Python 2) resp. bytes (Python 3) objects. I'm not exactly sure why it does that, but the result is clearly wrong for Python 3 and cuases the above error.

Changing:

    yield fsencode(path)

to

    yield path

on line 426 (and yield fsencode(sub_path) to yield sub_path a few lines below in setuptools.svn_utils.svn_finder fixed the issue for me. Probably breaks things on other systems though, so probably fsencode/fsdecode should be fixed somehow.

Bitbucket: https://bitbucket.org/pypa/setuptools/issue/99

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):

Update changelog. Fixes #99.

ghost commented 10 years ago

Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):

https://bitbucket.org/pypa/setuptools/pull-request/23/99-fixes/diff

ghost commented 10 years ago

Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):

I'm sorry, if I made you think I was handle waving your issue away.

What had happened was when refactoring this code to use SVN commands I had gotten it in my head that the interface for setuptools.file_finders returned encoded strings. I was pretty sure I had tested this and had matching test cases.

So why would path names be store as byte objects at all? Good question, it seems pretty stupid but that was the assumption I was working on. I wasn't going to change an interface.

Is this intentional or just an oversight? Returning bytes was intentional, but the intention was wrong. Was it intentional to break distutils? No

Under that assumption: path.encode() is a bug and returning bytes and str in py3 and py2 respectively would have been the correct types for encoded strings.

I don't see any down side to the workaround outside of some corner cases on certain platforms.

Answering Jason: It should work in python 3 if one does not have any (de)composable code points regardless of platform. Possibly not on platforms with certain unicode normalization rules for decomposable code points. Worse case is we might have to add something like unicodedata.normalize('NFC', path). Regarding python 2 support, Those regular expressions created by distutil.filelist.translate_pattern use "" string syntax and not u"" (relying on 2to3 to fix it up). So on python 2, we will still have to transcode from utf-8 to the default string encoding. The purpose of fsencode was to encode a filename in a format for the filesystem. In a metaphysical scene, it was to make sure everything was in the expected internal encoding and not utf8 or even mismatched normalized unicode. That is all my fault. We can probably remove them because that isn't what we want to do anyway. Do I recommend another solution? In the short-term for SpotlightKid, No. I don't. In the long-term I think there will be some encoding issues under python 2 and one unicode issue which we'll have to account for in the final fix. Does this sound sane?

I can fix this and adjust the test cases, and add a test for this this evening. (In a few hours).

ghost commented 10 years ago

Original comment by SpotlightKid (Bitbucket: SpotlightKid, GitHub: SpotlightKid):

fsencode spits out bytes with unicode input on Python 3, yes. The problem is, that the regular expressions in the methods to parse the manifest compare these to regexes, which are strings, which causes the exception. Python 2 is not the problem.

My analysis of the code path is that svn_finder feeds Python 3 strings to fsencode and these result from parsing the output of svn info -R --xml.

My question is, why should path names be stored as bytes objects in Python 3 at all? Is this intentional or just an oversight?

Apart from that that code in fsencode seems fishy to me:

def fsencode(path):
    "Path must be unicode or in file system encoding already"
    encoding = sys.getfilesystemencoding()

    if isinstance(path, unicode):
        path = path.encode()
    elif not isinstance(path, bytes):
        raise TypeError('%s is not a string or byte type'
                        % type(path).__name__)

     ...

It treats Python 3 strings and Python 2 unicode objects the same (compat sets unicode = str) and calls encode on them, but this yields bytes in one case and str in the other. Also, calling encode without argument, uses the current default string encoding (sys.getdefaultencoding()), which may not be the same as the current file system encoding (sys.getfilesystemencoding()).

ghost commented 10 years ago

Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):

Ok looks like the stuff parsed from the old per directory .svn is indeed directly fed in. Python3 should just do the right thing. Except possibly with filesystems that have a different normal form like macs. But this might only be an issue when parsing the xml output from the svn commands. Python 2 might still pose a problem. FileList I think contain strings. not unicode. So then we would have to worry possibly about encoding since coming from SVN we would have utf-8 strings. The regular expressions should accept these, but could turn up some various file name mismatches.. I suppose I ought to think about a test case for this. So fsencode as I have done without context is probably the wrong thing to do at some point, but I would guess not doing anything isn't right either.

ghost commented 10 years ago

Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):

fsencode/decode work as anticipated that is

fsencode => unicode ==> file system encoded bytes
fsdecode => filesystem encoded bytes ==> unicode

The old svn and the branched off plugin had to worry about encoding and decoding at some point. (would probably have to change tests in anycase because they do test for this) so I'll go refresh my memory. If that is the only place where it is being used, then I would wonder where are we not encoding. some of the filesystem functions depending how they are called will return byte strings. So I'll go double check.

Since this is an egg_info issue it my be the case that the iteration entry point is suppose to be encoded, but egg_info output decoded.

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):

@philip_thiem, Can you comment on the proposed workaround? What is the purpose of performing fsencode? Can you anticipate the downsides to it? Can you recommend another solution?

pypa / setuptools

egg_info comand fails parsing manifest template with Python 3 and SVN working dir #99