scrapinghub / portia

Visual scraping for Scrapy
BSD 3-Clause "New" or "Revised" License
9.29k stars 1.41k forks source link

Error in Slybot tests #500

Closed leorocher closed 7 years ago

leorocher commented 8 years ago

Hi,

I have an enconding error when running some of the the Slybot Tests. The error seems to be linked to the use of the scrapely library.

Here is for example what I have for test_spider.SpiderTest:

Traceback (most recent call last):
  File "/opt/amass-trainer-0.0.1-SNAPSHOT/slybot/slybot/tests/test_spider.py", line 317, in test_variants
    spider = self.smanager.create(name)
  File "/opt/amass-trainer/TRAINER_ENV/lib/python2.7/site-packages/slybot/spidermanager.py", line 61, in create
    **args)
  File "/opt/amass-trainer/TRAINER_ENV/lib/python2.7/site-packages/slybot/spider.py", line 41, in __init__
    settings, spec, item_schemas, all_extractors)
  File "/opt/amass-trainer/TRAINER_ENV/lib/python2.7/site-packages/slybot/spider.py", line 201, in _configure_plugins
    instance.setup_bot(settings, spec, schemas, extractors)
  File "/opt/amass-trainer/TRAINER_ENV/lib/python2.7/site-packages/slybot/plugins/scrapely_annotations/annotations.py", line 41, in setup_bot
    ), key=lambda x: x[0])
  File "/opt/amass-trainer/TRAINER_ENV/lib/python2.7/site-packages/slybot/plugins/scrapely_annotations/annotations.py", line 40, in <genexpr>
    for t in templates if t.get('page_type', 'item') == 'item'
  File "/opt/amass-trainer/TRAINER_ENV/lib/python2.7/site-packages/scrapely/htmlpage.py", line 49, in dict_to_page
    return HtmlPage(url, headers, body, page_id, encoding)
  File "/opt/amass-trainer/TRAINER_ENV/lib/python2.7/site-packages/scrapely/htmlpage.py", line 78, in __init__
    assert isinstance(body, unicode), "unicode expected, got: %s" % type(body).__name__
AssertionError: unicode expected, got: str

I am using python 2.7, Portia 16.06.1, slybot 0.13.0b18 and scrapely 0.12.0. Any clues?

Note that this error is not only happening in the tests but also at run time when running the scrapely extraction on some webpages.

ruairif commented 8 years ago

Hi @leorocher is this still happening for you? The slybot tests run fine on the CI server

leorocher commented 8 years ago

Tried again today after upgrading to slybot 0.13.0b19, I still have the same issue

FYI here is what I have installed:

pip freeze

Pillow==3.2.0
PyDispatcher==2.0.5
PyYAML==3.11
Scrapy==1.1.0
Twisted==16.1.1
adblockparser==0.5
autobahn==0.10.4
bcrypt==2.0.0
cffi==1.5.2
characteristic==14.3.0
chardet==2.3.0
cryptography==1.3.1
cssselect==0.9.1
dateparser==0.2.0
dulwich==0.9.7
enum34==1.1.2
funcparserlib==0.3.6
functools32==3.2.3.post2
idna==2.1
ipaddress==1.0.16
jdatetime==1.7.4
jsonschema==2.4.0
loginform==1.0
lupa==1.3
lxml==3.4.1
monotonic==0.3
mysql-connector-python==1.2.3
ndg-httpsclient==0.4.0
numpy==1.11.0
page-finder==0.1.1
parse==1.6.6
parsel==1.0.2
psutil==4.1.0
psycopg2==2.6.1
pyOpenSSL==16.0.0
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycparser==2.14
pysqlite==2.8.2
python-dateutil==2.4.2
pytz==2016.3
qt5reactor==0.3
queuelib==1.4.2
re2==0.2.23
regex==2016.4.3
requests==2.7.0
scrapely==0.12.0
scrapyjs==0.1.1
service-identity==14.0.0
six==1.10.0
slybot==0.13.0b19
slyd==0.0.0
splash==2.1
txaio==2.2.2
umalqurra==0.2
w3lib==1.14.2
wsgiref==0.1.2
xvfbwrapper==0.2.8
zope.interface==4.1.3
sagelliv commented 8 years ago

@leorocher: Can you try again with slybot 0.13.0b20?

samirfor commented 7 years ago
MBS:portia samirfor$ git log -1
commit 0526b932479cd9fe28ce587c676ec27719345d3d
Author: Ruairi Fahy <ruairifahy91@gmail.com>
Date:   Wed Oct 26 14:37:12 2016 +0100

    Release slybot 0.13.0b26

    Fix issue with empty css selectors
    Change IblItem to be a subclass of scrapy.Item
...
MBS:portia samirfor$ docker run -itp 9001:9001 --rm -v $PWD/data:/app/slyd/slyd/data:rw --name portia portia
2016-11-01 02:50:27+0000 [-] Log opened.
2016-11-01 02:50:27.947515 [-] Splash version: 2.2.1
2016-11-01 02:50:27.950249 [-] WARNING: Lua scripting is not available because 'lupa' Python package is not installed
2016-11-01 02:50:27.952117 [-] Qt 5.5.1, PyQt 5.5.1, WebKit 538.1, sip 4.17, Twisted 15.4.0
2016-11-01 02:50:27.954090 [-] Python 2.7.6 (default, Jun 22 2015, 17:58:13) [GCC 4.8.2]
2016-11-01 02:50:27.955168 [-] Open files limit: 1048576
2016-11-01 02:50:27.956392 [-] Can't bump open files limit
2016-11-01 02:50:28.434613 [-] Xvfb is started: ['Xvfb', ':1', '-screen', '0', '1024x768x24']
2016-11-01 02:50:30.906415 [-] /app/slyd/slyd/bot.py:25: scrapy.exceptions.ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
2016-11-01 02:50:30.925776 [-] /app/slyd/slyd/bot.py:32: scrapy.exceptions.ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
2016-11-01 02:50:32.917288 [-] /app/slyd/slyd/bot.py:60: scrapy.exceptions.ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead
2016-11-01 02:50:32.939310 [-] Site starting on 9002
2016-11-01 02:50:32.941136 [-] Starting factory <slyd.server.Site instance at 0x7f87c15cb128>
...
QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.
2016-11-01 02:50:16.956547 [-] "127.0.0.1" - - [01/Nov/2016:02:50:16 +0000] "GET /proxy?url=https%3A%2F%2Fwww.bilheteriavirtual.com.br%2Fimg%2Fheader-bg.png&tabid=139648766211024&referer=www.bilheteriavirtual.com.br HTTP/1.0" 200 48394 "http://192.168.99.100:9001/proxy?url=https%3A%2F%2Fwww.bilheteriavirtual.com.br%2Fcss%2Fstyle.css&tabid=139648766211024&referer=www.bilheteriavirtual.com.br" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"
Traceback (most recent call last):
  File "/app/slyd/slyd/splash/ferry.py", line 165, in sendMessage
    self.protocol.sendMessage(metadata(self.protocol))
  File "/app/slyd/slyd/splash/commands.py", line 100, in metadata
    res.update(extract(socket))
  File "/app/slyd/slyd/splash/commands.py", line 116, in extract
    js_items, js_links = extract_data(url, html, socket.spider, templates)
  File "/app/slyd/slyd/splash/utils.py", line 26, in extract_data
    for value in spider.parse(page(url, html)):
  File "/app/slybot/slybot/spider.py", line 228, in _handle
    for item_or_request in itertools.chain(*generators):
  File "/app/slybot/slybot/plugins/scrapely_annotations/annotations.py", line 121, in handle_html
    htmlpage = htmlpage_from_response(response, _add_tagids=True)
  File "/app/slybot/slybot/utils.py", line 103, in htmlpage_from_response
    encoding=response.encoding)
  File "/usr/local/lib/python2.7/dist-packages/scrapely/htmlpage.py", line 78, in __init__
    assert isinstance(body, unicode), "unicode expected, got: %s" % type(body).__name__
AssertionError: unicode expected, got: str
Aborted
ruairif commented 7 years ago

Should be fixed with latest scrapely