Open atb00ker opened 6 years ago
In many cases it may happen that parsed != (Dropped + stored) Reasons: Before continuing the discussion, let's understand the terms,
parsed
: Those URL's which were sent to parse_arcticle function to finally yield.
scraped
: Those URL's which were recieved by parse_article and we're yielded without errors.
dropped
: Those URL's which were dropped due to errors or duplicates. P.S. It may happen that dropped URL's > parsed because Applying Duplicate Check in the spider itself which prevent parse_article function beforehand.
stored
: URL's stored successfully in the datebase.
Now, regarding the issue, What you need to check for error verification is, parsed = stored (if duplicates check is applied on the spider too) parsed = dropped + stored ( if only pipelines handle duplicates) If above fails, means errors.
Before continuing the discussion, let's understand the terms
You have misunderstood the terms, refer: https://github.com/vipulgupta2048/scrape/projects/1#card-6130099
Now, regarding the issue, What you need to check for error verification is, parsed = stored (if duplicates check is applied on the spider too) parsed = dropped + stored ( if only pipelines handle duplicates) If above fails, means errors.
Maybe i did not explain the issue properly; requests send by scrapy.Request(url=url, callback=self.parse, errback=self.errorRequestHandler) = times callback runs+ times errback runs This equation needs to be true but fails and that is the issue! :)
Maybe i did not explain the issue properly; requests send by scrapy.Request(url=url, callback=self.parse, errback=self.errorRequestHandler) = times callback runs+ times errback runs This equation needs to be true but fails and that is the issue! :)
Oh. What key is updated when errback runs? and what key is updated when the pipeline drops an item?
Maybe i did not explain the issue properly; requests send by scrapy.Request(url=url, callback=self.parse, errback=self.errorRequestHandler) = times callback runs+ times errback runs This equation needs to be true but fails and that is the issue! :)
Oh. What key is updated when errback runs? and what key is updated when the pipeline drops an item?
None of keys, please read the code before posting questions! :)
The total number of requests send are not coming equal to received + dropped/failed for some spiders! The bug needs to be addressed for ensuring the integrity of the database! The following spiders have this problem: IndiaTv time(Tech) firstpost(sports) firstpost(hindi) More spiders might have the issue, but these have been caught misbehaving currently! @thisisayush