Patch to add unicode support to trunk

giampaolo commented 10 years ago

From darren.w...@gmail.com on January 10, 2012 17:42:05

Hey guys - we've forked the code and ported the unicode support from the py3k 
branch over to it. I've rebased against trunk today and you can find our code 
here: https://github.com/iwebhosting/pyftpdlib/ Purpose of code changes on this 
branch: To add utf-8 support to trunk/python 2.x When reviewing my code 
changes, please focus on: We have one failing test that I'm not sure what to do 
with - test_oob_abor. 

One of these two tests may be incorrect. The Python 2 version sends 0xf4 0xff 
but the Python 3 version sends 0xc3 0xb4 0xc3 0xb4 - according to 
http://support.microsoft.com/kb/231866 it looks like it *should* be 0xff 0xf4 
(the *reverse* of the python 2 test) - which version should we attempt to get 
the code to pass?

Here's the 3k test: 
https://code.google.com/p/pyftpdlib/source/browse/branches/trunk-3k/test/test_ftpd.py#1563
 Also I've omitted test_wrong_encoding: 
https://code.google.com/p/pyftpdlib/source/browse/branches/trunk-3k/test/test_ftpd.py#2277
 It wasn't raising a UnicodeDecodeError because all of the bytes sent fit into 
utf-8. After the review, I'll merge this branch into: /trunk

Original issue: http://code.google.com/p/pyftpdlib/issues/detail?id=198

giampaolo commented 10 years ago

From g.rodola on January 10, 2012 09:44:02

Oh! This is interesting.
I took a quick look at what you've done.
Starting from py3k branch was a good idea but it shouldn't be imported into 
trunk as-is as it uses some statements which will work on python >= 2.6 only 
and we need to support python 2.4.
I mean this:

return b''.join(buffer)
self.set_terminator(b"\r\n")
...

Basically all the b'' occurrences must be replaced with ''.
Also, any other reference which refers to the python 3.x porting must be 
removed. Example:

str(line, 'utf8')

Other than that and your failing test issue, how's the status of this?
Have you tried it in production? 
With what encodings?

Basically, if you successfully manage to add UTF8 support in your branch and 
enrich the test suite to prove it works and doesn't break the current 
bytes-based implementation, the next step would consist in producing a patch 
from your branch and apply it to main pyftpdlib trunk.

giampaolo commented 10 years ago

From darren.w...@gmail.com on January 10, 2012 09:57:33

Yep, we're using it with a low volume of traffic and its working fine for us 
using utf-8. I wasnt aware of the >=python 2.4 requirement, I'll get that 
sorted and remove those shamelessly copied docstrings :)

There are a couple of tests (MKD and MSLT) but I'll put a few more in there now 
I'm more familiar with how the tests work - thanks for the feedback. We have 
ran through a suite of operations using Cyberduck and Filezilla though, and 
they're fine in those clients at least 
(upload/download/mkdir/list/rename/delete).

giampaolo commented 10 years ago

From darren.w...@gmail.com on January 11, 2012 02:50:42

Ok, I think thats covered what you asked for - tests pass (except for the 
oob_abor) on python 2.4-2.7 (tested on Ubuntu 2.6.38 x64). I've attached a 
patch against trunk - how does that look?

giampaolo commented 10 years ago

From darren.w...@gmail.com on January 11, 2012 02:55:12

Lets try that again :)

Attachment: unicode.patch

giampaolo commented 10 years ago

From g.rodola on January 11, 2012 13:56:54

I'm probably missing something but I extracted the tests from your patch (tests 
only, no changes to ftpserver.py) and they work fine with the current 
bytes-based version.
Tried with filezilla as a client, and I was able to delete:

127.0.0.1:41796 <== DELE ☃ci☃ao
127.0.0.1:41796 ==> 250 File removed.

I'm confused here. Wasn't that supposed to NOT work? =)
And yes, I admit I've never fully understood how to work with unicode in python.

giampaolo commented 10 years ago

From darren.w...@gmail.com on January 11, 2012 14:15:02

Heh, I'm not 100% sure either :)

When we forked the code before Christmas (October or so I think) it didnt work, 
maybe something has changed since then? We only rebased against trunk a few days ago.

giampaolo commented 10 years ago

From darren.w...@gmail.com on January 12, 2012 07:31:18

Ok, I see what's happened here. 
http://docs.python.org/howto/unicode.html#unicode-filenames listdir() will 
return unicode strings *if called* with a unicode string, which is what we were 
doing in our filesystem subclass. If we switch to byte strings, pyftpdlib 
doesn't care and the clients figure out the encoding on their own.

So it's up to you if you want to keep this patch - it will explicitly make sure 
the that we're dealing with bytes. May as well keep the tests in either case :) 
Sorry for the confusion!

giampaolo commented 10 years ago

From g.rodola on January 13, 2012 10:25:21

I had a chat with Josiah Carlson who kindly explained to me how this is 
supposed to work (chat log is in attachment).
Basically what we have to do is this:

- read data (bytes) from the socket and convert that to unicode (line = 
sock.read(1024); line = line.decode('utf8'))
- pass the unicode string the the underlying filesystem functions (e.g. ls = 
os.listdir(u'path')) which will return unicode strings rather than bytes
- before sending the result back to client, encode the unicode string (ls = 
ls.encode('utf8'); sock.sendall(ls))

This is a pretty huge change but I think it's worth the effort as it should 
make the python 3 porting a lot easier.
Being a major change, it's likely we're going to break existing code. As such, 
this should happen in a major release (0.8.0 maybe).

The only thing which is not clear to me is what to do in case the client sends 
a line with a malformed encoding (!= UTF8).
RFC-2640 is not clear about this.

Attachment: chat.txt

giampaolo commented 10 years ago

From g.rodola on January 14, 2012 06:18:38

I added your tests as r958 .

giampaolo commented 10 years ago

From g.rodola on January 14, 2012 09:13:21

Ok, preliminary patch in attachment (applied to r963 ) adds 2.x unicode support.
It seems to work and also passes all tests.
I fixed OOB test failure was with:

line = line.decode('utf8', errors='replace')

Not sure whether it is better to return an error response though.

Status: Started

Attachment: unicode.patch

giampaolo commented 10 years ago

From g.rodola on January 14, 2012 12:20:57

New patch (based on r967 ) including test changes I previously left out by accident.

Attachment: unicode2.patch

giampaolo commented 10 years ago

From darren.w...@gmail.com on January 22, 2012 09:37:28

Sorry for the big delay - busy work week and a new born at home is keeping me 
occupied :)

The patch looks great to me - I've ran it internally and it certainly behaves 
the same as well.

Is the 0.7 release imminent?

giampaolo commented 10 years ago

From g.rodola on January 22, 2012 11:08:51

Yep, I hope to release it next week.
After then, we can start working on this.
We'll first introduce unicode support for python 2.x, stabilize it as much as 
possible, and then start the transition towards python 3.x as I already did 
with psutil: https://code.google.com/p/psutil/issues/detail?id=75 That way both 
python 2.x and 3.x will share the same code base and, most important, the same API.

It's likely that the new version supporting unicode and python 3 will be marked 
as 1.0.0 since that's going to break existent code using a customized 
filesystem class.

Labels: Compatibility

giampaolo commented 10 years ago

From g.rodola on May 22, 2012 05:16:14

Issue 121 has been merged into this issue.

giampaolo commented 10 years ago

From g.rodola on May 22, 2012 05:33:32

Ladies and gents, we made it: we now have full unicode support (RFC-2640) as 
part of the process which ported pyftpdlib to python 3.
Attached is a patch including all the changes which were submitted against the 
py3-unicode branch and which are now merged back into trunk as r1046 .

The change was massive in that we now use unicode instead of bytes pretty much 
everywhere, so it's likely that existent code bases overriding the base 
authorizer or filesystem classes won't work.
I'll write down instructions on how to make the porting later on.

Status: FixedInSVN
Labels: Milestone-1.0.0

Attachment: py3-unicode.patch

giampaolo commented 10 years ago

From g.rodola on February 19, 2013 04:49:26

Releasing 1.0.0 just now. Closing.

Status: Fixed
Labels: Version-0.7.0

giampaolo / pyftpdlib

Patch to add unicode support to trunk #198