pombreda / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

Porting to Python 3? #215

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hello, I've been porting feedparser to Python 3.

It passes most of the tests, if BeautifulSoup is turned off.
(BeautifulSoup is another module to be properly ported.)

The source code is here: http://bitbucket.org/puzzlet/feedparser-py3/src

Is there any plan to port feedparser into Python 3?

Thank you.

Original issue reported on code.google.com by puzz...@gmail.com on 30 May 2010 at 11:57

GoogleCodeExporter commented 9 years ago
I think this is a great idea. However we've only just discussed the idea of 
giving up backwards compatibility with 
versions older then 2.4: http://groups.google.com/group/feedparser-
dev/browse_thread/thread/f25ea27c41a0b196

How backwards compatible are your changes? Which older versions of Python are 
still supported?

Original comment by adewale on 6 Jun 2010 at 9:49

GoogleCodeExporter commented 9 years ago
Like most of Python 3 codes, my branch of feedparser isn't compatible with
Python 2.x, majorly because of the change in representation of strings.

Long stories short, str and unicode are renamed to bytes and str accordingly.
You don't use u"" for unicode anymore, but b"" for bytes. "" still represents
str, but it's not a stream of bytes but a literal of characters. (like unicode
in Python 2 was.)

Any other grammar and library changes are so small that it should at least run
on 2.4, if I add some lines to deal with lower versions of Python. But the
string is the major problem.

Original comment by puzz...@gmail.com on 7 Jun 2010 at 4:51

GoogleCodeExporter commented 9 years ago
FYI, Python 3 version of chardet has separate directory in the repository.[1]
BeautifulSoup[2] and NumPy[3] have own scripts for use in py3.

[1] http://code.google.com/p/chardet/source/browse/#hg/src-python3
[2] http://code.google.com/p/beautifulsoup/source/browse/branches/bs4/to3.sh
[3] http://projects.scipy.org/numpy/changeset?
new=7883%40trunk/setup.py&old=7828%40trunk/setup.py#file0

Original comment by puzz...@gmail.com on 7 Jun 2010 at 3:31

GoogleCodeExporter commented 9 years ago
I'll happily accept your patches if you can make them work with Python 2.4 or 
even 2.5. It would be even better if you can find a clean abstraction for the 
differences between versions.

Original comment by a...@google.com on 20 Jun 2010 at 2:57

GoogleCodeExporter commented 9 years ago
I've modified feedparser so that it runs cleanly in Python 2.4 and up, and can 
also be converted by the 2to3 tool and run in both Python 3.0 and 3.1. It 
passes all of the unit tests across Python 2.4 through 3.1.

https://github.com/kurtmckee/feedparser/tree/py3

I'm not able to compile Python 2.3 on Ubuntu Maverick, so I haven't tested my 
changes on any version of Python older than 2.4, but I avoided changing code as 
much as possible, so hopefully absurdly old versions of Python can continue to 
run the code if necessary. I haven't yet installed chardet and BeautifulSoup to 
test feedparser with those libraries, but I will in the near future.

The only caveat is that, because sgmllib was deprecated in Python 2.6 and is no 
longer included in Python 3, it's necessary to copy sgmllib.py from the Python 
2 standard library (I used the version included in Python 2.7), run it through 
the 2to3 tool, and remove the lines at the top that import, use, and then 
delete the warnpy3k module, which also doesn't exist in Python 3.

If anyone doesn't want to use git I can provide a patch file that will apply 
cleanly to svn r316. If it's appropriate I can add the sgmllib.py file to the 
git branch with the minor changes noted above.

Please let me know in what ways I can improve the branch so that it can be 
merged into svn trunk!

Original comment by kurtmckee on 26 Nov 2010 at 3:30

GoogleCodeExporter commented 9 years ago
I think we're ready to attempt the Python 3 merge. Can you generate a patch 
that I can apply against HEAD (as of revision 346) so that people can try it 
out.

Original comment by adewale on 22 Dec 2010 at 11:07

GoogleCodeExporter commented 9 years ago
I'm attaching the patch against r346 and the additional files I can think of, 
but I recommend pulling from the git branch I linked to above in case I drop 
the ball and miss a supporting file.

Original comment by kurtmckee on 23 Dec 2010 at 3:12

Attachments:

GoogleCodeExporter commented 9 years ago
As of revision 349 all of the changes for Python 3 support are in.
Note that I'm seeing the following errors when trying to run the tests using 
Python 3.1

======================================================================
ERROR: test_000225 (__main__.TestCase)
./tests/wellformed/http/headers_foo.xml: capture arbitrary HTTP header
----------------------------------------------------------------------
Traceback (most recent call last):
  File "feedparsertest.py", line 226, in <lambda>
    method(self, evalString, feedparser.parse(xmlfile))
  File "feedparsertest.py", line 143, in failUnlessEval
    if not eval(evalString, env):
  File "<string>", line 1, in <module>
KeyError: 'x-foo'

======================================================================
ERROR: test_000850 (__main__.TestCase)
./tests/illformed/encoding/linenoise.xml: unguessable characters
----------------------------------------------------------------------
Traceback (most recent call last):
  File "feedparsertest.py", line 143, in failUnlessEval
    if not eval(evalString, env):
  File "<string>", line 1
    bozo and entries[0].summary==u'\xe2\u20ac\u2122\xe2\u20ac\x9d\u0160'
                                                                       ^
SyntaxError: invalid syntax

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "feedparsertest.py", line 226, in <lambda>
    method(self, evalString, feedparser.parse(xmlfile))
  File "feedparsertest.py", line 151, in failUnlessEval
    if not eval(evalString, env):
  File "<string>", line 1
    bozo and entries[0].summary==u'\xe2\u20ac\u2122\xe2\u20ac\x9d\u0160'
                                                                       ^
SyntaxError: invalid syntax

----------------------------------------------------------------------
Ran 4099 tests in 42.266s

With my configuration 4099 tests are being run. How many are being run with 
your configuration?

Original comment by adewale on 24 Dec 2010 at 2:09

GoogleCodeExporter commented 9 years ago
4099 tests run and pass, I simply forgot to attach modified versions of the two 
tests that are failing (I wish you could pull directly from the git branches; 
the fact that Subversion makes you deal with patch files is heartbreaking!).

headers_foo.xml has to be modified because Python 2 and Python 3 handle HTTP 
headers differently. Python 2 normalizes all of the keys corresponding to HTTP 
header names to lowercase. Python 3 doesn't. For this reason it's necessary to 
modify the testcase to check for either 'x-foo' (for Python 2) or 'X-Foo' (for 
Python 3, which is also the actual header that was sent).

linenoise.xml needs to be modified because it's the only test that doesn't 
include a space between the '=' and the 'u'. Adding a space makes the u'' -> '' 
conversion in feedparsertest.py simpler.

Additionally, convert_to_py3.sh needs to be in the root directory (the same 
directory as README-PYTHON3). It fails to convert the files because the paths 
are incorrect where it's located now.

And finally, I noticed you added a line in README-PYTHON3 that we're requiring 
the 2to3 tool from Python 3.1, and I looked into what kind of problem you might 
be seeing. It looks like Python 3.0's 2to3 doesn't include the --no-diffs 
command line option, but Python 2.6 and Python 3.1 do. I recommend removing 
that line from README-PYTHON3 and instead modifying the sample command line in 
README-PYTHON3 as well as the conversion script so that they don't include the 
--no-diffs option. I've made and pushed this change to the git branch at github.

Original comment by kurtmckee on 24 Dec 2010 at 7:20

Attachments:

GoogleCodeExporter commented 9 years ago
As of r354 there's a one-line fix that's needed to get all of the tests passing 
in Python 3 (and with this patch all of my patches will be tested against 
Python 3.0 and 3.1!). Attached is a patch, git branch updated as well.

Original comment by kurtmckee on 3 Jan 2011 at 12:59

Attachments:

GoogleCodeExporter commented 9 years ago
Patch applied in revision 355. Ran the tests against Python 3: Ran 4106 tests 
in 41.667s
I'm marking this as fixed. Great work Kurt.

Original comment by adewale on 4 Jan 2011 at 3:45