jim256 / ebay

0 stars 0 forks source link

Change item_dropped to item_scraped to utilize Scrapy built-in performance stats? #17

Closed jim256 closed 4 years ago

jim256 commented 4 years ago

Just ran the latest code and noticed that it said it scraped 0 items, but the ending stats said it "dropped" 6647

image

Is that because it's calling them all "dropped" rather than "scraped"? My previous understanding from the docs was that "dropped" mean that the pipeline had received it but decided to not process and persist it.

Would you mind clarifying this point?

james-carpenter commented 4 years ago

I use a pipeline called ItemEater to drop the items from the pipeline after processing to the database since we are already accounting for them. If you want the default behavior, it's just commenting out the line in settings.py with ItemEaterPipeline. Then you'll get the item_scraped_count.

jim256 commented 4 years ago

I want to be sure that I understand the implications of any changes I make. Are there any other effects that this would cause? Would you mind elaborating on why you chose the ItemEater pipeline?

james-carpenter commented 4 years ago

No, there wouldn't be any negative side-effects in production runs. The only real difference (and the main reason I put it in there) is that when you are at a DEBUG log level, you will get a full copy of every item dumped into the log. That pipeline stage can be safely removed, especially when running at INFO log level as it will make no difference then.

In the future, if you wanted to do something with the items after saving them to the database, you could add another pipeline to take further action if you don't eat them.