scrapy-plugins / scrapy-jsonrpc

Scrapy extension to control spiders using JSON-RPC
297 stars 72 forks source link

Please complete doc and test it #3

Open movingheart opened 8 years ago

movingheart commented 8 years ago

Some suggestions:

  1. complete your doc about how to use, please give a example in scrapy;
  2. this code have some bugs, eg. https://github.com/movingheart/django_example/blob/master/QQ%E5%9B%BE%E7%89%8720160628005154.png
denity commented 7 years ago

Has this bug fixed yet?

redapple commented 7 years ago

@denity , if you're referring to :

2017-05-18 11:25:57 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy-jsonrpc.py2/local/lib/python2.7/site-packages/twisted/protocols/basic.py", line 571, in dataReceived
    why = self.lineReceived(line)
  File "/home/paul/.virtualenvs/scrapy-jsonrpc.py2/local/lib/python2.7/site-packages/twisted/web/http.py", line 1811, in lineReceived
    self.allContentReceived()
  File "/home/paul/.virtualenvs/scrapy-jsonrpc.py2/local/lib/python2.7/site-packages/twisted/web/http.py", line 1906, in allContentReceived
    req.requestReceived(command, path, version)
  File "/home/paul/.virtualenvs/scrapy-jsonrpc.py2/local/lib/python2.7/site-packages/twisted/web/http.py", line 771, in requestReceived
    self.process()
--- <exception caught here> ---
  File "/home/paul/.virtualenvs/scrapy-jsonrpc.py2/local/lib/python2.7/site-packages/twisted/web/server.py", line 190, in process
    self.render(resrc)
  File "/home/paul/.virtualenvs/scrapy-jsonrpc.py2/local/lib/python2.7/site-packages/twisted/web/server.py", line 241, in render
    body = resrc.render(self)
  File "/home/paul/.virtualenvs/scrapy-jsonrpc.py2/local/lib/python2.7/site-packages/scrapy_jsonrpc/txweb.py", line 11, in render
    return self.render_object(r, txrequest)
  File "/home/paul/.virtualenvs/scrapy-jsonrpc.py2/local/lib/python2.7/site-packages/scrapy_jsonrpc/txweb.py", line 14, in render_object
    r = self.json_encoder.encode(obj) + "\n"
  File "/home/paul/.virtualenvs/scrapy-jsonrpc.py2/local/lib/python2.7/site-packages/scrapy_jsonrpc/serialize.py", line 89, in encode
    return super(ScrapyJSONEncoder, self).encode(o)
  File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
  File "/home/paul/.virtualenvs/scrapy-jsonrpc.py2/local/lib/python2.7/site-packages/scrapy_jsonrpc/serialize.py", line 109, in default
    return super(ScrapyJSONEncoder, self).default(o)
  File "/usr/lib/python2.7/json/encoder.py", line 184, in default
    raise TypeError(repr(o) + " is not JSON serializable")
exceptions.TypeError: <scrapy.crawler.Crawler object at 0x7f14cac75dd0> is not JSON serializable

when accessing http://localhost:<webserviceport>/crawler, then I believe it's not a valid bug.

With Python 2.7, scrapy 1.3.3 and scrapy-jsonrpc and a simple spider like this:

# -*- coding: utf-8 -*-
import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    def start_requests(self):
        for i in range(0, 1000):
            yield scrapy.Request('http://httpbin.org/get?q=%d' % i)

    def parse(self, response):
        pass

I also get that error when accessing the webservice endpoint in my browser.

But this is not the intended way to interact with this RPC extension.

User should use it in similar way to what example-client.py does.

Example usage: (note: the warnings below should be addressed with https://github.com/scrapy-plugins/scrapy-jsonrpc/pull/11)

$ python example-client.py -H localhost -P 6025 list-running
/home/paul/src/scrapy-jsonrpc/scrapy_jsonrpc/serialize.py:8: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
  from scrapy.spider import Spider
spider:7f9fe4276890:example

This internal does an HTTP GET on /crawler/engine/open_spiders

GET /crawler/engine/open_spiders HTTP/1.1
Accept-Encoding: identity
Host: localhost:6025
Connection: close
User-Agent: Python-urllib/2.7

HTTP/1.1 200 OK
Content-Length: 32
Access-Control-Allow-Headers:  X-Requested-With
Server: TwistedWeb/17.1.0
Connection: close
Date: Thu, 18 May 2017 09:34:49 GMT
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, PATCH, PUT, DELETE
Content-Type: application/json

["spider:7f9fe4276890:example"]

In other words, the /crawler resource is not usable directly (at least with GET in a browser).

Although the example client has bugs too. Stats for example are available at /crawler/stats, not /stats:

list-available does a POST on /crawler/spiders:

$ python example-client.py -H localhost -P 6025 list-available
/home/paul/src/scrapy-jsonrpc/scrapy_jsonrpc/serialize.py:8: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
  from scrapy.spider import Spider
/home/paul/src/scrapy-jsonrpc/scrapy_jsonrpc/jsonrpc.py:40: ScrapyDeprecationWarning: Call to deprecated function unicode_to_str. Use scrapy.utils.python.to_bytes instead.
  data = unicode_to_str(json.dumps(req))
example

POST /crawler/spiders HTTP/1.1
Accept-Encoding: identity
Content-Length: 59
Host: localhost:6025
Content-Type: application/x-www-form-urlencoded
Connection: close
User-Agent: Python-urllib/2.7

{"params": {}, "jsonrpc": "2.0", "method": "list", "id": 1}

HTTP/1.1 200 OK
Content-Length: 51
Access-Control-Allow-Headers:  X-Requested-With
Server: TwistedWeb/17.1.0
Connection: close
Date: Thu, 18 May 2017 09:37:16 GMT
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, PATCH, PUT, DELETE
Content-Type: application/json

{"jsonrpc": "2.0", "result": ["example"], "id": 1}

get-global-stats does another POST:

$ python example-client.py -H localhost -P 6025 get-global-stats
/home/paul/src/scrapy-jsonrpc/scrapy_jsonrpc/serialize.py:8: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
  from scrapy.spider import Spider
/home/paul/src/scrapy-jsonrpc/scrapy_jsonrpc/jsonrpc.py:40: ScrapyDeprecationWarning: Call to deprecated function unicode_to_str. Use scrapy.utils.python.to_bytes instead.
  data = unicode_to_str(json.dumps(req))
log_count/DEBUG                          115
scheduler/dequeued                       113
log_count/INFO                           12
downloader/response_count                113
downloader/response_status_count/200     113
log_count/WARNING                        4
scheduler/enqueued/memory                113
downloader/response_bytes                72569
start_time                               2017-05-18 09:32:18
scheduler/dequeued/memory                113
scheduler/enqueued                       113
downloader/request_bytes                 24743
response_received_count                  113
downloader/request_method_count/GET      114
downloader/request_count                 114

POST /crawler/stats HTTP/1.1
Accept-Encoding: identity
Content-Length: 64
Host: localhost:6025
Content-Type: application/x-www-form-urlencoded
Connection: close
User-Agent: Python-urllib/2.7

{"params": {}, "jsonrpc": "2.0", "method": "get_stats", "id": 1}

HTTP/1.1 200 OK
Content-Length: 528
Access-Control-Allow-Headers:  X-Requested-With
Server: TwistedWeb/17.1.0
Connection: close
Date: Thu, 18 May 2017 09:38:54 GMT
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, PATCH, PUT, DELETE
Content-Type: application/json

{"jsonrpc": "2.0", "result": {"log_count/DEBUG": 115, "scheduler/dequeued": 113, "log_count/INFO": 12, "downloader/response_count": 113, "downloader/response_status_count/200": 113, "log_count/WARNING": 4, "scheduler/enqueued/memory": 113, "downloader/response_bytes": 72569, "start_time": "2017-05-18 09:32:18", "scheduler/dequeued/memory": 113, "scheduler/enqueued": 113, "downloader/request_bytes": 24743, "response_received_count": 113, "downloader/request_method_count/GET": 114, "downloader/request_count": 114}, "id": 1}