istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.17k stars 323 forks source link

How to submit a POST request through crawler REST API #231

Closed YanzhongSu closed 4 years ago

YanzhongSu commented 4 years ago

From Scrapy cluster official documentation, it tells you the required properties and optional properties when submitting request. It seems like I cannot specify what type of request (GET or POST) I can submit. Any submitted request will, by default, be seen as GET.

I've gone trough the documentation many times and yet to find the answer to it. In the end, I've came up with a very hacky solution.

  1. First submit the task as normal GET request and use attrs to pass the payload
    curl -X POST http://crawler-api/feed \
    -H 'Content-Type: application/json' \
    -d '{"url": "https://mysite.com", "attrs": {"payload": "areaID=101&listingType=sale&category=land&latitudeLow=18.801600217516675&latitudeHigh=53.21873673612902&longitudeLow=-1.3098910185796058&longitudeHigh=57.576827731420394&offset=0"}, "spiderid": "real_estate", "crawlid": "prod", "appid": "united_kingdom"}'
  2. Inside the spider, I have those few lines of code:
    if response.request.method == "GET":
    req = Request(item["url"], method='POST', body=item['attrs']['payload'], callback=self.parse)
    req.dont_filter = True
    req.meta['appid'] = item['appid']
    req.meta['method'] = 'POST'
    req.meta['body'] = item['attrs']['payload']
    req.meta['attrs'] = item['attrs']
    yield req

    As you can see, I check to see if it is a GET request, if it is, I will then re-submit it as a POST request.

  3. Then I added the following code in the function next_request under distributed_scheduler.py:
    if 'method' in item:
    req.method = item['method']
    req.headers = item['headers']
    if 'body' in item:
    req = req.replace(body=item['body'])

    To specify the method and update body attribute.

Even though it solved my problem. But I think there gotta be a better way to do this.

madisonb commented 4 years ago

Are you trying to submit the POST request on the very first request out of the spider? If so, you are correct as the distributed scheduler assumes you are trying to make a GET in this method. You are more than welcome to modify anything in that method to fit your needs, just return a request in the end. You covered most of this in step 3.

In step 2, your spider can yield any kind of request you want, since the scheduler uses the Scrapy serializer to store the request in redis. When it is read back out, it calls this method to parse it back into a normal request and move on.

This project is simply assuming you are trying to crawl a web page straight from the start, but then you can do whatever you want afterwards. I think your could improve your implementation by altering the kafka monitor crawl api to the params you require for your POST, then parse them in that method I linked above to form your initial post request and let the spider handle the rest.

YanzhongSu commented 4 years ago

Are you trying to submit the POST request on the very first request out of the spider? If so, you are correct as the distributed scheduler assumes you are trying to make a GET in this method. You are more than welcome to modify anything in that method to fit your needs, just return a request in the end. You covered most of this in step 3.

In step 2, your spider can yield any kind of request you want, since the scheduler uses the Scrapy serializer to store the request in redis. When it is read back out, it calls this method to parse it back into a normal request and move on.

This project is simply assuming you are trying to crawl a web page straight from the start, but then you can do whatever you want afterwards. I think your could improve your implementation by altering the kafka monitor crawl api to the params you require for your POST, then parse them in that method I linked above to form your initial post request and let the spider handle the rest.

Thank you for the answer. I think I know what to do next.