jaraco / path

Object-oriented file system path manipulation
MIT License
1.1k stars 147 forks source link

Crash on walk with errors set to warn #73

Closed gazpachoking closed 9 years ago

gazpachoking commented 10 years ago

Seems there can be an issue creating the warning text when using the walk method in warn mode. I suspect this might only happen on python 2.

  File "/usr/lib64/python2.7/site-packages/flexget/plugins/filter/exists.py", line 44, in on_task_filter
    for p in folder.walk(errors='warn'):
  File "/usr/lib64/python2.7/site-packages/path.py", line 576, in walk
    TreeWalkWarning)
  File "/usr/lib64/python2.7/warnings.py", line 29, in _show_warning
    file.write(formatwarning(message, category, filename, lineno, line))
  File "/usr/lib64/python2.7/warnings.py", line 38, in formatwarning
    s =  "%s:%s: %s: %s\n" % (filename, lineno, category.__name__, message)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 46-60: ordinal not in range(128)
gazpachoking commented 10 years ago

Downstream bug report, for reference. http://flexget.com/ticket/2704

jaraco commented 10 years ago

I don't think there's anything path.py can do to address this issue. It's quite possible the issue is a limitation of the warnings module.

At the very least, we'll need an example how to recreate the error.

gazpachoking commented 10 years ago

It happens whenever path.py tries to warn about a file/directory that is incorrectly encoded, which essentially turns 'warn' mode into 'strict' mode, albeit with a more obtuse error. I'll do some experiments and see if I can dig deeper into exactly what the issue is.

gazpachoking commented 10 years ago

So, the issue is that the warning message path.py creates is a unicode object which includes the invalid directory name. When warnings tries to create a message (which on python 2 is a str) it tries to decode the unicode warning message, which is of course not possible with the ascii codec.

gazpachoking commented 10 years ago

I'm thinking to fix this on python 2, rather than using child in the warning message, we can use child.encode('unicode_escape') or something of that nature to give the warnings module the bytes it desires. This of course won't be that useful to the user, since the name was encoded wrong and we didn't know how to interpret it, but it at least won't crash.

gazpachoking commented 10 years ago

Or maybe there is a way to round trip bytes->unicode->bytes properly on python 2 with the replacement surrogateescape. I suspect the answer is in here somewhere. https://github.com/PythonCharmers/python-future/blob/master/future/utils/surrogateescape.py

gazpachoking commented 10 years ago

I think the nicest solution would be if we could have a (python 2) __str__ method, which encoded using the filesystem encoding, and supported turning the surrogates into the proper source bytes. That way warnings module, and any modules dealing with the actual path (like here) would work.

Hmm, now that I think about it, maybe we wouldn't even need to warn anymore, and it would just handle the invalid encoded paths.

gazpachoking commented 10 years ago

I did some testing with the encodefilename and decodefilename from future.utils.surrogateescape on python 2 and it seemed to work well (tested with utf-8 paths and filesystemencoding forced to ascii.). It would be super cool if path.py transparently encoded the filename back to bytes when calling the system functions whenever on python 2 and os.path.supports_unicode_filenames was false. If there is interest in this I can give a go at a pull request.

jaraco commented 10 years ago

What do you think of the latest patch, which allows the caller of path.walk to supply an arbitrary callable for the error handler? In that case, create an error handler that takes a unicode 'msg' and encode it suitably before passing it to warnings.warn.

gazpachoking commented 10 years ago

Here is my quick test to make paths with invalid encodings still work on python 2. I haven't given it too much testing, but at first glance it seems to work. I imagine you don't want to add any deps, so the stuff from future could be pulled in directly. Also, it would probably be nice to make _native_fs_string something public, probably with a better name (I suck at naming.)

https://github.com/gazpachoking/path.py/compare/py2_invalid_encoding

gazpachoking commented 10 years ago

What do you think of the latest patch, which allows the caller of path.walk to supply an arbitrary callable for the error handler? In that case, create an error handler that takes a unicode 'msg' and encode it suitably before passing it to warnings.warn.

Oh, didn't see this till I finished up my test branch. That would work fine for me too, in that case I'd probably just use logging in a custom error handler rather than warnings. I think it would be cooler though if it just supported invalid encodings like python 3 does (it does doesn't it? haven't tested, but I imagine surrogateescape handler can go back to original bytstring when encoded there), which is what my branch is doing.

gazpachoking commented 10 years ago

I'll do some testing on python 3 to see the behavior there, but it seems to me that if the 'warn' mode stays in, it's still a bug that it causes an exception on python 2, even if there is an alternative to handle the warning yourself with a custom handler.

gazpachoking commented 10 years ago

I haven't looked in to how to implement, but it would also be cool also to have some tests run with LC_ALL environment variable set to something like en_US.ascii to test that any warning/exceptions are done properly for invalid encoded paths.

gazpachoking commented 10 years ago

Yeah, it seems like on python 3 invalid encoded paths are handled, both with os.path functions and with path.py. It'd be nice to bring python 2 into parity when using path.py. The 'warn' mode (and other error modes) of walk would then become superfluous.

gazpachoking commented 10 years ago

I separated out the idea of improving invalid filename support on python 2 to #73, since that is really a separate idea than this bug.

gazpachoking commented 9 years ago

Just to clarify what the bug is here, on python 2, warnings.warn expects a byte string for the message.

jaraco commented 9 years ago

I think we're looking at a wontfix resolution on this, but I'm happy to consider another fix if appropriate.