Test for unicode path - Githubissues

stsewd commented 6 years ago

Test to expose fix for #27

This test was a little hard to figure out how to do it, since I wasn't able to use the name on the py file, so I had to add an actual file (I take the file from https://github.com/rtfd/readthedocs.org/issues/3732#issuecomment-370650285) and I had to made it py2 and py3 compatible. Please let me know if there is a better way.

stsewd commented 6 years ago

I forgot to make it py3 compatible

stsewd commented 6 years ago

You can also look to some of the tests that probably already mock out the find_one calls. You shouldn't need to depend on a local repo file for the test, though this isn't a huge problem.

I tried to use apply_fs to reproduce the bug, but I wasn't able to do it (even copying the same name of the file from this test). Even manually creating a file with the same name (copying it from the one that fails).

I think probably this is due a bad encoding or a corrupted file from the user? Even on my OS the file isn't show correctly.

bad_encode

agjohnson commented 6 years ago

So you probably aren't able to reproduce this easily as your encoding is not ascii. Here is what the servers give us back:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

That is bad, and in fact, I'm not sure why that's happening, as our locale is set:

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

What you're testing might not be the right thing to test. Also, you are having trouble creating a file or path with this codepoint because your encoding is likely 'utf-8' already. With the encoding ascii, I have no problem doing this with py.path and apply_fs. If you try the following on a box with default utf8 encoding, you'll get a file with a codepoint that isn't 0xf0:

In [1]: import os

In [2]: os.mkdir('/tmp/foo/')

In [3]: os.mkdir('/tmp/foo/f\xf0\xf0')

In [4]: ls /tmp/foo
f??/

In [5]: os.listdir('/tmp/foo')
Out[5]: ['f\xf0\xf0']

To make this more confusing, however, even with this I'm not able to reproduce the problem with find_all:

In [18]: path
Out[18]: local('/tmp/foo')

In [19]: path.listdir()
Out[19]: [local('/tmp/foo/f\xf0\xf0'), local('/tmp/foo/bo\xf0')]

In [20]: list(find_all('/tmp/foo', ['readthedocs.yml']))
Out[20]: []

So, I'm pretty confused at this point.

stsewd commented 6 years ago

So you probably aren't able to reproduce this easily as your encoding is not ascii. Here is what the servers give us back:

I got the same on my local instance

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

I asked to the user for more information https://github.com/rtfd/readthedocs.org/issues/3732#issuecomment-371110187, and I think probably that file was generated with some weird encoding that only Windows understand (I have seen some Windows files showing as invalid encode on my machine a couple of times). Also now that I remember, a couple of weeks ago I was helping to a German friend to setup his rtd instance and he was using Windows, and faced some similar problems with encoding (but on other part of the build).

ericholscher commented 6 years ago

The fix for python encoding being crazy is "use python3". I've tried to fix the server encodings a million times, and it doesn't work. We just need to move to an all UTF-8 world.

ericholscher commented 6 years ago

Looks like a good test to have, so going to merge this.

readthedocs / readthedocs-build

Test for unicode path #45