scrapinghub / portia

Visual scraping for Scrapy
BSD 3-Clause "New" or "Revised" License
9.28k stars 1.41k forks source link

portiacrawl doesn't work #647

Closed sp4ghet closed 7 years ago

sp4ghet commented 7 years ago

Running portia on an ubuntu container in docker. (Default docker image/dockerfile did not work)

Python 2.7.6 in virtualenv

$ pip freeze

adblockparser==0.7
autobahn==0.10.4
cffi==1.9.1
characteristic==14.3.0
chardet==2.3.0
cryptography==1.6
cssselect==1.0.0
dateparser==0.3.5
dulwich==0.9.7
enum34==1.1.6
funcparserlib==0.3.6
idna==2.1
ipaddress==1.0.17
jdatetime==1.8.1
jsonschema==2.4.0
loginform==1.2.0
lxml==3.6.0
monotonic==0.3
mysql-connector-python==1.2.3
ndg-httpsclient==0.4.0
numpy==1.11.2
page-finder==0.1.2
parse==1.6.6
parsel==1.1.0
Pillow==3.4.2
psutil==5.0.0
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycparser==2.17
PyDispatcher==2.0.5
pyOpenSSL==16.2.0
python-dateutil==2.4.2
pytz==2016.10
PyYAML==3.12
qt5reactor==0.3
queuelib==1.4.2
regex==2016.11.21
requests==2.7.0
scrapely==0.12.0
Scrapy==1.1.0
scrapyjs==0.1.1
service-identity==14.0.0
six==1.10.0
-e git+https://github.com/scrapinghub/portia@1b920f1875d3b067b3989e5afb8a3b17c2bf0e71#egg=slybot&subdirectory=slybot
-e git+https://github.com/scrapinghub/portia@1b920f1875d3b067b3989e5afb8a3b17c2bf0e71#egg=slyd&subdirectory=slyd
splash==2.3
Twisted==15.4.0
txaio==2.5.2
umalqurra==0.2
w3lib==1.16.0
xvfbwrapper==0.2.8
zope.interface==4.3.2
$ portiacrawl slyd/data/projects/new_project example.webscraping.com

...
Unhandled error in Deferred:
CRITICAL:twisted:Unhandled error in Deferred:
2016-12-08 11:43:07 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 57, in run
    self.crawler_process.crawl(spname, **opts.spargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 163, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 167, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 90, in crawl
    six.reraise(*exc_info)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 71, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 94, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 50, in from_crawler
    spider = cls(*args, **kwargs)
  File "/home/rikuo/portia/portia/slybot/slybot/spidermanager.py", line 55, in __init__
    **kwargs)
  File "/home/rikuo/portia/portia/slybot/slybot/spider.py", line 50, in __init__
    settings, spec, item_schemas, all_extractors)
  File "/home/rikuo/portia/portia/slybot/slybot/spider.py", line 215, in _configure_plugins
    instance.setup_bot(settings, spec, schemas, extractors, self.logger)
  File "/home/rikuo/portia/portia/slybot/slybot/plugins/scrapely_annotations/annotations.py", line 87, in setup_bot
    for page, scrapes, version in group]))
  File "/usr/local/lib/python2.7/dist-packages/scrapely/extraction/__init__.py", line 51, in __init__
    parsed_plus_tdpairs = [(parse_template(self.token_dict, td[0]), td) for td in td_pairs]
  File "/usr/local/lib/python2.7/dist-packages/scrapely/extraction/pageparsing.py", line 28, in parse_template
    parser.feed(template_html)
  File "/usr/local/lib/python2.7/dist-packages/scrapely/extraction/pageparsing.py", line 57, in feed
    self.handle_tag(data, index)
  File "/usr/local/lib/python2.7/dist-packages/scrapely/extraction/pageparsing.py", line 100, in handle_tag
    self._handle_open_tag(html_tag)
  File "/usr/local/lib/python2.7/dist-packages/scrapely/extraction/pageparsing.py", line 192, in _handle_open_tag
    if jannotation.pop('generated', False):
exceptions.TypeError: pop() takes at most 1 argument (2 given)
CRITICAL:twisted:
2016-12-08 11:43:07 [twisted] CRITICAL:

So I guess this could be more of a scrapely or twisted or scrapy problem, not sure where to start though. I guess jannotation is being thought of as a list and not a dict

kode-ninja commented 7 years ago

Same here - vagrant on ubuntu (using Vagrantfile from git)

jomlaapps commented 7 years ago

same here

AlexTan-b-z commented 7 years ago

Same here, I find the latest version of slybot have the problem.

DharmeshPandav commented 7 years ago

same error on scrapinghub platform as well !

brandomr commented 7 years ago

I get this same error trying to use portia2code on a portia project.

ruairif commented 7 years ago

@brandomr Can you update your slybot to the latest version (0.13.0b31). Also if you are using Portia2code it only works with Portia 2.0 projects

brandomr commented 7 years ago

@ruairif thanks, my slybot is Version: 0.13.0b31 but I built Portia off the master branch, not nui-develop so I believe I'm not using Portia 2.0--is that the case?