scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Centralized Request fingerprints #900

Closed kmike closed 1 year ago

kmike commented 9 years ago

It is very easy to have a subtle bug when using a custom duplicates filter that changes how request fingerprint is calculated.

The problem is that there is no way to override request fingerprint globally; to make Scrapy always take something extra in account (an http header, a meta option) user must override duplicates filter and all cache storages that are in use.

Ideas about how to fix it:

  1. Use duplicates filter request_fingerprint method in cache storage if this method is available;
  2. create a special Request.meta key that request_fingerprint function will take into account;
  3. create a special Request.meta key that will allow to provide a pre-calculated fingerprint;
  4. add a settings.py option to override request fingerprint function globally.
nramirezuy commented 9 years ago

What are you trying to do? Why you want to change how request fingerprint is calculated?

kmike commented 9 years ago

Example 1: requests to Splash in proxy mode. Fingerprint should be different when proxy is enabled, and it also should depend on request headers (because that's how proxy behaviour is controlled).

Example 2: anything that makes use of 'include_headers' argument of request_fingerprint function.

chekunkov commented 9 years ago

Example 3: calculate case insensitive request fingerprints

nramirezuy commented 9 years ago

@chekunkov Insensitive in the arguments or the path

>>> from scrapy.utils.request import request_fingerprint
>>> from scrapy.http import Request
>>> request_fingerprint(Request('http://example.com?Y=t')) == request_fingerprint(Request('http://EXAMPLE.COM?y=T'))
False
>>> request_fingerprint(Request('http://example.com?Y=t')) == request_fingerprint(Request('http://EXAMPLE.COM?Y=t'))
True
>>> request_fingerprint(Request('http://example.com?Y=t')) == request_fingerprint(Request('http://EXAMPLE.COM?Y=T'))
False
>>> request_fingerprint(Request('http://example.com/index?Y=t')) == request_fingerprint(Request('http://EXAMPLE.COM/Index?Y=t'))
False
>>> request_fingerprint(Request('http://example.com/index?Y=t')) == request_fingerprint(Request('http://EXAMPLE.COM/index?Y=t'))
True
chekunkov commented 9 years ago

@nramirezuy in one project I wanted fingerprint to be calculated for the url with lowercased path with query and fragments removed

chekunkov commented 9 years ago

I've made PR during that project https://github.com/scrapy/scrapy/pull/533

chekunkov commented 9 years ago
  1. add a settings.py option to override request fingerprint function globally.

@kmike +1

nramirezuy commented 9 years ago

@kmike In HCF fingerprint generation is different, I think we are in troubles already.

kmike commented 9 years ago

@chekunkov I also like (4) most. Probably we will have to undo/change some parts of https://github.com/scrapy/scrapy/pull/597.

@nramirezuy you're right; HCF middleware in scrapylib should probably use the same fingerprint function as other Scrapy parts. This is not a Scrapy issue though.

chekunkov commented 9 years ago

@kmike @nramirezuy can you explain "why we are in troubles already" by example? For HCF and dupefilter+cache, because it isn't clear enough for me.

kmike commented 9 years ago

Option (2) also has benefits; with a special meta key components can add something to request meta to change fingerprints. For example, if a Splash downloader middleware uses proxy interface then it knows for sure the fingerprint must be different; it makes sense to handle it at middleware level without special efforts from an user. An alternative for a component is to provide a compatible request fingerprint function that users can use, but it doesn't work if there are several components which want to change fingerprints.

Using Splash proxy interface for a middleware is a bad idea because it doesn't handle https and probably will never do, but that's another question :)

kmike commented 9 years ago

@chekunkov by default HCF middleware uses url as a fingerprint. If you want to vary on something else (e.g. take a header in account) some requests will be dropped by HCF - you put them to queue but never get them back.

nramirezuy commented 9 years ago

I think that each component should be able to handle his own version of fingerprint by themselves. Just making it easy to extend.

kmike commented 9 years ago

for the record: I agree with @nramirezuy and also think option (2) is better.

redapple commented 9 years ago

An overridable global request_fingerprint (4) would be quite useful IMO And at the same time a default request_fingerprint that accept a special meta key, to force a fingerprint -- as in (2) with the effect of (3) -- would be easy for users and developers

kmike commented 9 years ago

I'd like to implement it. So, the questions / thoughts:

  1. Fingerprint is built as a sha1 hash of several strings. I think that's fine (because it allows to easily add more things into the mix), but we need to escape hashed parts or insert some separators. sha1.update(a); sha1.update(b) is the same as sha1.update(a+b), so the existing implementation looks buggy: it runs

    fp.update(canonicalize_url(request.url))
    fp.update(request.body or '')

    which means that an empty POST request to example.com/POST and POST request to example.com/ with POST body result in the same fingerprint:

    In [1]: from scrapy.utils.request import request_fingerprint
    In [2]: from scrapy.http import Request
    In [3]: req1 = Request('http://example.com/POST', method='POST')
    In [4]: req2 = Request('http://example.com/', method='POST', body='POST')
    In [5]: request_fingerprint(req1)
    Out[5]: 'fb3d5ed717c89556ceebb1e1aa8dbdda2b96963e'
    In [6]: request_fingerprint(req2)
    Out[6]: 'fb3d5ed717c89556ceebb1e1aa8dbdda2b96963e'
  2. To implement centralized fingerprints we should add a centralized place (per-request) to store the strings hashing function builds the final hash from.

    An alternative is to provide a way to specify the final fingerprint as a string, but it looks like a harder API to work with - each component will have to take previous fingerprint into account, a buggy component which just sets the fingerprint could disable what other components did, and it will work slower because each component will likely do its own hashing.

  3. I can see two places where we can store "fingerprint ingridients": Request meta and a new Request attribute.
  4. There is a question about the default fingerprint components: request.method, canonicalize_url(request.url) and request.body or '' - should we always add them implicitly or should we include them by default and allow users to remove them? I vote for including them, but allowing users to change them. Not doing so will limit what can be done; case-insensitive URLs is an example.
  5. If we include default ingridients then users should be able to get them individually. "Ingridients storage" could be a list or a dict, not a set. For list we should guarantee that by default the first item is request method, the second is an url and the third is the body. For dict each ingridient could be under its own key, e.g. {"url": canonicalize_url(request.url), "method": request.method, "body": request.body or ""}. I like dict more. When building a fingerprint we could prefix values with the key, i.e. do fp.update("url:http://example.com"). Dict makes it easy to add/remove/change values and inspect where the value came from. A disadvantage of the dict (and prefixing the values) is that we won't be able to provide a backwards-compatible request fingerprint function (for the same input it will give different answer); this could invalidate user caches after Scrapy update.
  6. include_headers option look redundant with dict-based centralized fingerprints. But I'm fine with keeping it. It is not used by Scrapy itself, but user code may use it, and it could be convenient. I'm also fine with deprecating include_headers.
  7. Deciding between Request.meta vs Request.fingerprint is not easy.

    Request.fingerprint is more intrusive, and with such attribute it could make sense to move request_fingerprint to Request class itself. It also doesn't play nicely with Request immutability - if it is a mutable dict then Request becomes mutable; if it is immutable then users will have to do some dances with replace (changing individual fingerprint components will be harder), and it will break request identity in some cases. Mutable attribute could be fine though - Request.meta is mutable.

    Request.meta gives us problems if we want to include default components and allow users to change them - there is no precendent of including anything in meta by default, and user code can forget to copy meta["fingerprint"].

There are many decision to make :) My proposal, to have something to talk about:

kmike commented 9 years ago

/cc @pablohoffman @dangra because the changes could touch Request class

nramirezuy commented 9 years ago

@kmike Few questions:

kmike commented 9 years ago

@nramirezuy

When the hash is calculated?

A good question. I'd say on first access to Request.fingerprint_hash attribute / function / (whatever name will it use). Also, in order to prevent errors after the first access to fingerprint_hash Request.fingerprint should become immutable.

Are we going to keep func:request_fingerprint or the hash will be calculated inside the class:Request?

The hash will be calculated inside Request class, but we'll keep request_fingerprint for backwards compatibility and for include_headers option.

Request.fingerprint will allow us to change the hash calculation behavior? How?

It will allow us to change hash calculation algorithm only by subclassing. This limits what can be done, but I don't see use cases when a global change is needed, and if there are such use cases it can be fixed later.

nramirezuy commented 9 years ago

I can't see the advantage over the current implementation, besides the new way of calculation (key:value update). Also if I don't see use cases when a global change is needed why implement it.

kmike commented 9 years ago

@nramirezuy The advantage is that you can globally add/remove/change keys/values to be hashed. You can't globally swap sha1 to md5, that's what I don't see an use case for.

kmike commented 9 years ago

By the way, if we fix existing request_fingerprint bug then users will have their caches invalidated because this function will return different result for the same input. So I'd not worry about "fingerprint ingridients" in a dict being backwards incompatible - it has the same problem, but it doesn't matter because we already have this problem.

nramirezuy commented 9 years ago

We can do the fix in 2 steps:

I think Request().meta['fingerprint'] should be the way of using overriding fingerprint globally. As @redapple said before by forcing request_fingerprint to return that value if present. I just see the problem on moving from one to another and adding a bunch of stuff to request doesn't seem to be the best approach.

kmike commented 9 years ago

Add a warning telling people to enable certain setting. "Add FINGERPRINT_OLD =True in project your settings if you want to keep current fingerprint calculation approach, in next version it will be changed and will affect data caches and resumed runs".

I'm OK with this option, but users can keep old behaviour by using Scrapy 0.24.4 - why is an extra setting needed? Scrapy resumed runs don't use request fingerprints, so they shouldn't be affected.

I think Request().meta['fingerprint'] should be the way of using overriding fingerprint globally. As @redapple said before by forcing request_fingerprint to return that value if present. I just see the problem on moving from one to another and adding a bunch of stuff to request doesn't seem to be the best approach.

Downsides of meta["fingerprint"]:

I like that meta["fingerprint"] allows us to make less changes to Request, but request fingerprint is an important concept used by many Scrapy components, so I'm also ok with adding it as an attribute to the base Request class.

nramirezuy commented 9 years ago

I think Dupefilter does, have to check.

If you choose an attribute but you don't copy it with replace, doesn't the dev have to handle it anyways?

kmike commented 9 years ago

Well you have to calculate the fingerprint on the first spider middleware executed, even before anyone else use it, if the spider has to do it; the developer has to be aware of this. This method supports things like a number for people that pulls from mysql.

Could you please clarify it, I don't quite understand what do you mean. I think Request fingerprint hash should be calculated on first access, and this first access in most cases will be in Scheduler (via a Dupefilter); middlewares won't access the fingerprint hash.

But there aren't components that modifies the fingerprint, they use the default fingerprint or calculate its own.

The main motivation for this ticket is to provide a way to create reusable components that modify the fingerprint. For example, scrapy-splash middleware must add some values from Request.meta to the fingerprint, and it looks like a good idea to add some values from hcf_params meta key (HCF middleware) to the fingerprint as well.

Is not more code, you create your fingerprint, you assign it to the meta of a request and suddenly every component that uses request_fingerprint will start using your custom fingerprint instead.

A reusable component shouldn't just override the previous fingerprint, it should change it in some way, preserving what was added to it by other components or by user. For example, user may want to vary requests on a specific cookie value; user adds this header to fingerprint. Then e.g. HCF or scrapy-splash (or whatever) middleware wants the fingerprint to take some meta keys into account - it shouldn't just use default request method + url + body and add a meta value, it should take into account changes user made (i.e. that specific cookie value). And then another middleware wants the fingerprint to use case-insensitive URLs - it should keep a cookie and a meta value, but replace 'url' with e.g. a lower-cased 'url'. Then the first function/method which needs the actual fingerprint as a single hash (likely a dupefilter) combines all these components and calculates the hash value.

If you choose an attribute but you don't copy it with replace, doesn't the dev have to handle it anyways?

Right, it is a problem for both meta and attribute ("But the data in a Request attribute has the same issue with .replace()").

nramirezuy commented 9 years ago

But the only components able to change this fingerprint would be the ones called before scheduler (spider, spidermiddlwares), also what happens if 2 or more components wants to modify the url for example. I think fingerprint should be calculated by a middleware or use the default; by default I mean user defined or scrapy calculation function. If a dev wants to create a new fingerprint, just does it and store it in Request.meta; suddenly every component will use it, because there are no components on scrapy calculating his own fingerprint, they all use the default function.

michael-yin commented 9 years ago

I think we should make the fingerprint calculation in a isolated download middleware which focus on fingerprint, after the fingerprint has been calculated by the middleware, save the value in response.meta. If other middleware need to know the fingerprint of the url, just find it in response.meta.

I think this process have many advantages:

We can do it step by step:

  1. We need to write a built-in middleware (we can call it Fingerprintmiddleware) which use the default 'request_fingerprint' to calculate the fingerprint and save the value in response.meta, and the built-in duplicate filter and cache storage use the value in response.meta.
  2. I think DUPEFILTER_CLASS might not be useful because now there is middleware take care of the fingerprint of url. Actually I think duplicate filter now should concert about if the url has been crawled, not the fingerprint calculation.

We can add FINGERPRINT_CLASS in settings.py, it might look like this.

FINGERPRINT_CLASS = {
   'Custome_class_fingerprint' : 200
} 

If FINGERPRINT_CLASS is in settings.py, we will disable the built-in Fingerprintmiddleware, the number in the config is the priority number which control the order.

hecris commented 5 years ago

Hello all, my name is Christopher He and I'm a student at Hunter College located in New York. I would like to contribute to scrapy as part of Google Summer of Code 2019. Here is my starting point.

How can I expand this proposal? Can someone provide more examples where customisable fingerprinting would be useful?

Example 1: requests to Splash in proxy mode. Fingerprint should be different when proxy is enabled, and it also should depend on request headers (because that's how proxy behaviour is controlled).

scrapy.http.request.Request

Gallaecio commented 5 years ago

Can someone provide more examples where customisable fingerprinting would be useful?

  1. Consider two or more requests the same based on a part of the URL. For example:

    https://example.com/random-text1/id
    https://example.com/random-text2/id
  2. Compare URLs case-insensitively. For example:

    https://example.com/id
    https://example.com/ID
  3. Consider two or more requests the same based on specific query parameters. For example:

    https://example.com/random-text1?id=id&referer=1
    https://example.com/random-text2?id=id&referer=2
  4. Consider two or more requests the same regardless of query parameters. For example:

    https://example.com/id?referer=1
    https://example.com/id?referer=2
  5. Consider two or more requests the same based on a header. For example:

    Request(url1, headers={'X-ID': id})
    Request(url2, headers={'X-ID': id})
  6. Consider two or more requests different based on a header. For example:

    Request(url, headers={'X-ID': id1})
    Request(url, headers={'X-ID': id2})
  7. Consider two or more requests the same based on specific cookies. For example:

    Request(url1, headers={'Cookie': 'id=id, key2=value2'})
    Request(url2, headers={'Cookie': 'id=id, key2=value3'})
  8. Consider two or more requests different based on a cookie. For example:

    Request(url, headers={'Cookie': 'id=id1'})
    Request(url, headers={'Cookie': 'id=id2'})
  9. Consider two or more requests the same based on information added to Request.meta. For example:

    Request(url1, meta={'id': id})
    Request(url2, meta={'id': id})

As important as the scenarios where this is useful is the granularity to customize fingerprinting. Developers should be able to customize fingerprinting:

  1. Globally (e.g. through settings.py)

  2. On a specific spider (e.g. through MySpider.custom_settings)

  3. On specific requests (e.g. through a Request.meta key)

hecris commented 5 years ago

Thanks @Gallaecio . I have a better understanding now. Here is my proposal: To make fingerprints as customisable as possible, we can allow the user to define their own 'rules' in settings.py.

CUSTOM_FINGERPRINTING = {
        """
        Consider two or more requests the same based on a header.
        For example:
            Request(url1, headers={'X-ID': id})
            Request(url2, headers={'X-ID': id})
        """
        1: {
            'check': 'headers',
            'key': 'X-ID',
            }

        """
        Consider two or more requests the same based on specific query parameters.
        For example:
            https://example.com/random-text1?id=id&referer=1
            https://example.com/random-text2?id=id&referer=2
        """
        2: {
            'check': 'query_params',
            'key': 'id',
            }
        }

It's not perfect, but it's a starting point. What do you think? Also, is there a direct way to contact you, i.e via email? @Gallaecio

lenzai commented 5 years ago

+1

On Mon, Mar 4, 2019, 7:41 AM Christopher He notifications@github.com wrote:

Thanks @Gallaecio https://github.com/Gallaecio . I have a better understanding now. Here is my proposal: To make fingerprints as customisable as possible, we can allow the user to define their own 'rules' in settings.py or MySpider.custom_settings.

CUSTOM_FINGERPRINTING = { """ Consider two or more requests the same based on a header. For example: Request(url1, headers={'X-ID': id}) Request(url2, headers={'X-ID': id}) """ 1: { 'check': 'headers', 'key': 'X-ID', }

    """
    Consider two or more requests the same based on specific query parameters.
    For example:
        https://example.com/random-text1?id=id&referer=1
        https://example.com/random-text2?id=id&referer=2
    """
    2: {
        'check': 'query_params',
        'key': 'referer',
        }
    }

In the first example, the request_fingerprint function will take into account headers['id'] before anything else. Similarly, in the second example, the request_fingerprint function will consider 'body['referer']. If no rules are encountered, the request_fingerprint` function will proceed with its default action. It's not perfect, but it's a starting point. What do you think?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scrapy/scrapy/issues/900#issuecomment-469077800, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSBx5ybYukKhpE6jr_j75boGNvFwaoIks5vTF2TgaJpZM4Clv0g .

Gallaecio commented 5 years ago

I like where this is going, however I see a potential issue with the proposed API.

Good APIs should make common scenarios easy and all scenarios possible. I believe the suggested approach makes common scenarios easy to handle, but it may fall short for unexpectedly complex scenarios.

For example:

Fingerprint 1:
https://example.com/random-text1?id=1
https://example.com/random-text1?id=1&category=1

Fingerprint 2:
https://example.com/random-text1?id=1&category=2

As for my email, I’ve just added it to my Github profile.

hecris commented 5 years ago

@Gallaecio could you clarify the examples? Is fingerprint 1 composed of two URLs? When would this be the case, or am I interpreting your examples wrong?

Gallaecio commented 5 years ago

@hecris Both URLs would return the same fingerprint. The reasoning would be that the underlying website uses "category=1" as default, so some links would include the category while others do not, but they should be considered duplicates for the purposes of duplicate filtering and caching.

hecris commented 5 years ago

I agree, the initial proposal, while easy to implement, could not handle complex scenarios.

Here is my revised suggestion. It uses the same concept of 'rules', but can handle more complex situations. It also lets the user define default values.

CUSTOM_FINGERPRINTING_ENABLED = TRUE

CUSTOM_FINGERPRINTING = {
  1: {
    'consider_method': True,
    'consider_url': True,
    'query_params': {
      'id': None,
      'category': '1',  # set default value to 1
    },
    'headers': {
      'ignore': True,  # headers are ignored entirely
    },
    'meta': {
      'ignore': False,  # all meta data will be considered
    },
  }
}

Result:

kmike commented 5 years ago

Hey! Currently I'm not sure we should be developing a rule engine for this. For example, instead of

CUSTOM_FINGERPRINTING = {
  1: {
    'consider_method': True,
    'consider_url': True,
    'query_params': {
      'id': None,
      'category': '1',  # set default value to 1
    },
    'headers': {
      'ignore': True,  # headers are ignored entirely
    },
    'meta': {
      'ignore': False,  # all meta data will be considered
    },
  }
}

I'd prefer to have something along these lines:

from functools import partial
from scrapy.utils.request import request_fingerprint

REQUEST_FINGERPRINT_FUNC = partial(
    request_fingerprint, 
    include_headers=True, 
    include_meta={'foo': True}
)

This way one can use any Python function, and instead of using a dict with string constants we use function arguments. This allows to

This exact API probably won't work, as I'd like it to handle more cases - it'd be good to have a way to customize it not only in settings.py, but also in middlewares and in a spider as well, per-request. Anyways, you get the idea :)

By the way, https://github.com/scrapy/scrapy/pull/3420 may be relevant, as we started to look at fingerprints and thinking about API.

hecris commented 5 years ago

Thanks for the feedback @kmike. I will rethink my proposal and get back to you guys

lenzai commented 5 years ago

How about using rewrite rules ? Rewriting urls to a canonical version and storing fingerprints only for these ones.

On Wed, Mar 6, 2019 at 4:26 AM Christopher He notifications@github.com wrote:

Thanks for the feedback @kmike https://github.com/kmike ! I will rethink my proposal and get back to you guys

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scrapy/scrapy/issues/900#issuecomment-469843688, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSBx8Gln2h_arHYtD9L57sdP32UT-elks5vTtLdgaJpZM4Clv0g .

Gallaecio commented 5 years ago

@lenzai Fingerprints go beyond URLs.