rmax / scrapy-redis

Redis-based components for Scrapy.
http://scrapy-redis.readthedocs.io
MIT License
5.54k stars 1.59k forks source link

[QUESTION] Is there a way to use response.meta ? #290

Closed SteliosGiannatos closed 8 months ago

SteliosGiannatos commented 8 months ago

Description

I've been using Scrapy for web scraping tasks and frequently pass additional information using response.meta for further processing within the spider. My current project requires integrating data fetched from Redis into Scrapy's response.meta, and I'm seeking advice on implementing this using scrapy redis.

For instance, I store JSON objects in a Redis list that contain metadata I want to access later:

{
  "url": "https://example.com",
  "original_author": "william shakespeare"
}

I scrape the website and i see that this specific book for example now states that the author is Agatha Christie. I can then implement some logic if the author changed.

Question:

Is there an existing feature in scrapy_redis that facilitates the passing of information from Redis to response.meta ?

Use Case:

My primary use case is to check for updates on information that we expect to not change, but can change. For example I can check the response.meta['original_author'] and check if it is the same author in the url i provided. If there are any changes in the author i can add to my database the updated name, the name of the previous author, the date the author was changed etc. Of course this can work with any information that changes. In my example i say author name, because i am not talking about something like reviews where you expect to see increase in number of reviews of a product for example. But rather, something that you expect to remain constant, but there are cases where information changes.

SteliosGiannatos commented 8 months ago

I just show

        Returns a `Request` instance for data coming from Redis.

        Overriding this function to support the `json` requested `data` that contains
        `url` ,`meta` and other optional parameters. `meta` is a nested json which contains sub-data.

        Along with:
        After accessing the data, sending the FormRequest with `url`, `meta` and addition `formdata`, `method`
        For example:
        {
            "url": "https://exaple.com",
            "meta": {
                'job-id':'123xsd',
                'start-date':'dd/mm/yy'
            },
            "url_cookie_key":"fertxsas",
            "method":"POST"
        }

        If `url` is empty, return []. So you should verify the `url` in the data.
        If `method` is empty, the request object will set method to 'GET', optional.
        If `meta` is empty, the request object will set `meta` to {}, optional.

        This json supported data can be accessed from 'scrapy.spider' through response.
        'request.url', 'request.meta', 'request.cookies', 'request.method'

        Parameters
        ----------
        data : bytes
            Message from redis.

In the RedisSpider under the method "make_request_from_data".

I should've done a bit more digging before posting. Thanks anyway!