scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 315 forks source link

support CJK string annotation; print readably CJK string in scrapely.tool's output #45

Open xyb opened 10 years ago

xyb commented 10 years ago

scrapely.tool will crash when using CJK string as annotation in scrapely.tool:

$ python -m scrapely.tool blog.json
scrapely> ta http://blog.douban.com/douban/2013/07/04/2630/
[0] http://blog.douban.com/douban/2013/07/04/2630/
scrapely> t 0 算法工程师如何改进豆瓣电影 TOP250
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/xyb/scrapely.xyb/scrapely/tool.py", line 189, in <module>
    main()
  File "/Users/xyb/scrapely.xyb/scrapely/tool.py", line 186, in main
    t.cmdloop()
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/cmd.py", line 142, in cmdloop
    stop = self.onecmd(line)
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/cmd.py", line 221, in onecmd
    return func(arg)
  File "/Users/xyb/scrapely.xyb/scrapely/tool.py", line 48, in do_t
    selection = apply_criteria(criteria, tm)
  File "/Users/xyb/scrapely.xyb/scrapely/tool.py", line 147, in apply_criteria
    sel = tm.select(func)
  File "scrapely/template.py", line 48, in select
    score = score_func(fragment, htmlpage)
  File "scrapely/template.py", line 95, in func
    if text in fdata:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal not in range(128)

I fixed it, and add improved the usability of scrapely.tool's output that including CJK unicode characters:

$ python -m scrapely.tool blog.json
scrapely> t 0 算法工程师如何改进豆瓣电影 TOP250
[0] u'<h1>算法工程师如何改进豆瓣电影 TOP250</h1>'
[1] u'<title>豆瓣blog  &raquo; Blog Archive   &raquo; 算法工程师如何改进豆瓣电影 TOP250</title>'
[2] u'<link rel="alternate" type="application/rss+xml" title="豆瓣blog &raquo; 算法工程师如何改进豆瓣电影 TOP250 评论 Feed" href="http://blog.douban.com/douban/2013/07/04/2630/feed/" />'
scrapely> 
xyb commented 10 years ago

A doctest is reasonable. Actually I had tried adding a doctest on this but failed:

    >>> u = u'cjk 中日韩 \\u535a'
    >>> u
    u'cjk \u4e2d\u65e5\u97e9 \\u535a'
    >>> repr(u)
    "u'cjk \\u4e2d\\u65e5\\u97e9 \\\\u535a'"
    >>> print repr(u)
    u'cjk \u4e2d\u65e5\u97e9 \\u535a'
    >>> readable_repr(u)
    u"u'cjk \u4e2d\u65e5\u97e9 \\\\u535a'"
    >>> print readable_repr(u)
    u'cjk 中日韩 \\u535a'

It's a copy of python shell output, can be used as document. But if your run it as doctest, you will get this strange result:

**********************************************************************
File "readable_repr.py", line 12, in __main__.readable_repr
Failed example:
    u
Expected:
    u'cjk \u4e2d\u65e5\u97e9 \u535a'
Got:
    u'cjk \xe4\xb8\xad\xe6\x97\xa5\xe9\x9f\xa9 \u535a'
**********************************************************************
File "readable_repr.py", line 14, in __main__.readable_repr
Failed example:
    repr(u)
Expected:
    "u'cjk \u4e2d\u65e5\u97e9 \\u535a'"
Got:
    "u'cjk \\xe4\\xb8\\xad\\xe6\\x97\\xa5\\xe9\\x9f\\xa9 \\u535a'"
**********************************************************************
File "readable_repr.py", line 16, in __main__.readable_repr
Failed example:
    print repr(u)
Expected:
    u'cjk \u4e2d\u65e5\u97e9 \u535a'
Got:
    u'cjk \xe4\xb8\xad\xe6\x97\xa5\xe9\x9f\xa9 \u535a'
**********************************************************************
File "readable_repr.py", line 18, in __main__.readable_repr
Failed example:
    readable_repr(u)
Expected:
    u"u'cjk \u4e2d\u65e5\u97e9 \\u535a'"
Got:
    u"u'cjk \\xe4\\xb8\\xad\\xe6\\x97\\xa5\\xe9\\x9f\\xa9 \u535a'"
/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py:1531: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if got == want:
/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py:1551: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if got == want:
**********************************************************************
File "readable_repr.py", line 20, in __main__.readable_repr
Failed example:
    print readable_repr(u)
Expected:
    u'cjk 中日韩 \u535a'
Got:
    u'cjk \xe4\xb8\xad\xe6\x97\xa5\xe9\x9f\xa9 博'
**********************************************************************
1 items had failures:
   5 of   6 in __main__.readable_repr
***Test Failed*** 5 failures.
kmike commented 10 years ago

In Python 2.x doctests just can't handle non-ascii text. There are some bugs about that in Python bug tracker, but as I recall they are all closed because the issue is fixed for Python 3.x. In 2.x it won't work.

pablohoffman commented 10 years ago

Maybe just add a unittest if doctests don't handle non-ascii text in Python 2.x?

xyb commented 10 years ago

@pablohoffman, @kmike, Sorry for the delay replying, I have added unittests for the readable_repr function and best_match text encoding correction(moved to scrapely.tool already).

mattdbr commented 9 years ago

Any updates?

kmike commented 9 years ago

@akkatracker if you use latest scrapely master in Python 3 it should print all characters correctly. Fixing it for Python 2.x could be ugly.

Unicode input issues are fixed by #46, both for Python 2.x and 3.x.

The issue from the PR description should be fixed in scrapely master if you use Python 3.x. This PR provides some nice unit tests, fixes similar to #56 and an attempt to fix unicode output for Python 2.x (not finished), that's why it is not closed.