istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

what's the purpose of item_copy in pipelines.py? #218

Closed ChengkaiYang2022 closed 5 years ago

ChengkaiYang2022 commented 5 years ago

hi,In crawler/crawling/pipelines.py in the "LoggingBeforePipeline" class,there is a variable called "item_copy" in function process_item,it just simple turn item into a dict,and delete some keys like "body","links" and after that it does nothing except for logging. So what is the purpose for the "item_copy"? And I also have another question here,if the item is not RawResponseItem,maybe like a user defined Item,it will return None,and the following Pipeline will not recive the item,and those will do nothing. I'm so confused about this function here.?

madisonb commented 5 years ago
  1. The item copy is just that, it creates a new copy in memory so when we delete the keys it does not modify the original dictionary. Otherwise you risk deleting keys from your original dictionary if you delete them inside the function
    # example where a python function modifies the original dictionary
    $ python
    Python 3.7.3 (default, Mar 27 2019, 09:23:15)
    [Clang 10.0.1 (clang-1001.0.46.3)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> d = {'key':'value'}
    >>> def f(d):
    ...   del d['key']
    ...
    >>> f(d)
    >>> d
    {}
    >>>

    vs

    # non-modifying version
    $ python
    Python 3.7.3 (default, Mar 27 2019, 09:23:15)
    [Clang 10.0.1 (clang-1001.0.46.3)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> d = {'key':'value2'}
    >>> def f(d):
    ...   d2 = dict(d)
    ...   del d2['key']
    ...   print(d2)
    ...
    >>> d
    {'key': 'value2'}
    >>> f(d)
    {}
    >>> d
    {'key': 'value2'}
    >>>
  2. Correct, you can do whatever item logic you want here, but this project assumes you are utilizing the RawResponseItem class. If you want to make your own modifications for your own items you certainly are welcome to fork this project.
ChengkaiYang2022 commented 5 years ago

modifies

Thanks for reply!I understand know:)