Closed ghost closed 8 years ago
Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):
https://bitbucket.org/pypa/setuptools/pull-request/23/99-fixes/diff
Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):
I'm sorry, if I made you think I was handle waving your issue away.
What had happened was when refactoring this code to use SVN commands I had gotten it in my head that the interface for setuptools.file_finders returned encoded strings. I was pretty sure I had tested this and had matching test cases.
So why would path names be store as byte objects at all? Good question, it seems pretty stupid but that was the assumption I was working on. I wasn't going to change an interface.
Is this intentional or just an oversight? Returning bytes was intentional, but the intention was wrong. Was it intentional to break distutils? No
Under that assumption: path.encode() is a bug and returning bytes and str in py3 and py2 respectively would have been the correct types for encoded strings.
I don't see any down side to the workaround outside of some corner cases on certain platforms.
Answering Jason: It should work in python 3 if one does not have any (de)composable code points regardless of platform. Possibly not on platforms with certain unicode normalization rules for decomposable code points. Worse case is we might have to add something like unicodedata.normalize('NFC', path). Regarding python 2 support, Those regular expressions created by distutil.filelist.translate_pattern use "" string syntax and not u"" (relying on 2to3 to fix it up). So on python 2, we will still have to transcode from utf-8 to the default string encoding. The purpose of fsencode was to encode a filename in a format for the filesystem. In a metaphysical scene, it was to make sure everything was in the expected internal encoding and not utf8 or even mismatched normalized unicode. That is all my fault. We can probably remove them because that isn't what we want to do anyway. Do I recommend another solution? In the short-term for SpotlightKid, No. I don't. In the long-term I think there will be some encoding issues under python 2 and one unicode issue which we'll have to account for in the final fix. Does this sound sane?
I can fix this and adjust the test cases, and add a test for this this evening. (In a few hours).
Original comment by SpotlightKid (Bitbucket: SpotlightKid, GitHub: SpotlightKid):
fsencode spits out bytes with unicode input on Python 3, yes. The problem is, that the regular expressions in the methods to parse the manifest compare these to regexes, which are strings, which causes the exception. Python 2 is not the problem.
My analysis of the code path is that svn_finder
feeds Python 3 strings to fsencode
and these result from parsing the output of svn info -R --xml
.
My question is, why should path names be stored as bytes objects in Python 3 at all? Is this intentional or just an oversight?
Apart from that that code in fsencode
seems fishy to me:
def fsencode(path):
"Path must be unicode or in file system encoding already"
encoding = sys.getfilesystemencoding()
if isinstance(path, unicode):
path = path.encode()
elif not isinstance(path, bytes):
raise TypeError('%s is not a string or byte type'
% type(path).__name__)
...
It treats Python 3 strings and Python 2 unicode objects the same (compat
sets unicode = str
) and calls encode
on them, but this yields bytes
in one case and str
in the other. Also, calling encode
without argument, uses the current default string encoding (sys.getdefaultencoding()
), which may not be the same as the current file system encoding (sys.getfilesystemencoding()
).
Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):
Ok looks like the stuff parsed from the old per directory .svn is indeed directly fed in. Python3 should just do the right thing. Except possibly with filesystems that have a different normal form like macs. But this might only be an issue when parsing the xml output from the svn commands. Python 2 might still pose a problem. FileList I think contain strings. not unicode. So then we would have to worry possibly about encoding since coming from SVN we would have utf-8 strings. The regular expressions should accept these, but could turn up some various file name mismatches.. I suppose I ought to think about a test case for this. So fsencode as I have done without context is probably the wrong thing to do at some point, but I would guess not doing anything isn't right either.
Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):
fsencode/decode work as anticipated that is
The old svn and the branched off plugin had to worry about encoding and decoding at some point. (would probably have to change tests in anycase because they do test for this) so I'll go refresh my memory. If that is the only place where it is being used, then I would wonder where are we not encoding. some of the filesystem functions depending how they are called will return byte strings. So I'll go double check.
Since this is an egg_info issue it my be the case that the iteration entry point is suppose to be encoded, but egg_info output decoded.
Originally reported by: SpotlightKid (Bitbucket: SpotlightKid, GitHub: SpotlightKid)
I have a project (python-rtmidi), which uses setuptools in the
setup.py
file and I build in a SVN checkout. I have aMANIFEST.in
template to include examples and gererated files in the source distribution.python setup.py sdist
fails at theegg_info
command step with this exception:After quite some time spent tracking this down, I noticed that the file list generated by
setuptools.commands.egg_info.find_sources()
contains paths asbytes
objects, which makes theexclude_pattern
method (see traceback) choke. These entries in the filelist are produced by thesetuptools.svn_utils.svn_finder
function, which callsfsencode
on every path it yields.fsencode
encodes the unicode (Python 2) ressp. str (Python 3) path entries with the default encoding, turning them into str (Python 2) resp. bytes (Python 3) objects. I'm not exactly sure why it does that, but the result is clearly wrong for Python 3 and cuases the above error.Changing:
to
on line 426 (and
yield fsencode(sub_path)
toyield sub_path
a few lines below insetuptools.svn_utils.svn_finder
fixed the issue for me. Probably breaks things on other systems though, so probablyfsencode/fsdecode
should be fixed somehow.