vipulgupta2048 / scrape

Mission - scraping the planet, one website at a time
MIT License
10 stars 13 forks source link

Lost Links causing failing data integrity! #51

Open atb00ker opened 6 years ago

atb00ker commented 6 years ago

The total number of requests send are not coming equal to received + dropped/failed for some spiders! The bug needs to be addressed for ensuring the integrity of the database! The following spiders have this problem: IndiaTv time(Tech) firstpost(sports) firstpost(hindi) More spiders might have the issue, but these have been caught misbehaving currently! @thisisayush

thisisayush commented 6 years ago

In many cases it may happen that parsed != (Dropped + stored) Reasons: Before continuing the discussion, let's understand the terms,

parsed: Those URL's which were sent to parse_arcticle function to finally yield. scraped: Those URL's which were recieved by parse_article and we're yielded without errors. dropped: Those URL's which were dropped due to errors or duplicates. P.S. It may happen that dropped URL's > parsed because Applying Duplicate Check in the spider itself which prevent parse_article function beforehand. stored: URL's stored successfully in the datebase.

thisisayush commented 6 years ago

Now, regarding the issue, What you need to check for error verification is, parsed = stored (if duplicates check is applied on the spider too) parsed = dropped + stored ( if only pipelines handle duplicates) If above fails, means errors.

atb00ker commented 6 years ago

Before continuing the discussion, let's understand the terms

You have misunderstood the terms, refer: https://github.com/vipulgupta2048/scrape/projects/1#card-6130099

Now, regarding the issue, What you need to check for error verification is, parsed = stored (if duplicates check is applied on the spider too) parsed = dropped + stored ( if only pipelines handle duplicates) If above fails, means errors.

Maybe i did not explain the issue properly; requests send by scrapy.Request(url=url, callback=self.parse, errback=self.errorRequestHandler) = times callback runs+ times errback runs This equation needs to be true but fails and that is the issue! :)

thisisayush commented 6 years ago

Maybe i did not explain the issue properly; requests send by scrapy.Request(url=url, callback=self.parse, errback=self.errorRequestHandler) = times callback runs+ times errback runs This equation needs to be true but fails and that is the issue! :)

Oh. What key is updated when errback runs? and what key is updated when the pipeline drops an item?

atb00ker commented 6 years ago

Maybe i did not explain the issue properly; requests send by scrapy.Request(url=url, callback=self.parse, errback=self.errorRequestHandler) = times callback runs+ times errback runs This equation needs to be true but fails and that is the issue! :)

Oh. What key is updated when errback runs? and what key is updated when the pipeline drops an item?

None of keys, please read the code before posting questions! :)