trueuoc / spa_housing_crawler

Web Crawler for scraping spanish housing prices (idealista)
17 stars 17 forks source link

Removed houses are stored as denied #1

Open jbarrerobuch opened 3 years ago

jbarrerobuch commented 3 years ago

When a house's link is saved for scrape and after that the offer is removed. When the crawler scrapes the data it saves it the denied houses list. That's because the yield in line 86 of houses spider.

yield response.follow(next_page_url[0], callback=self.parse, errback=self.parse_deny)

The removed house offer returns a 403 error, but the spider must understand the difference between a denied access and a removed offer. Otherwise the denied list is never completed and the scrape won't be complete. Therefore as the houses spider is call by zones and subzones. If there's one house link removed (not denied) the scrape get's stuck in that zone until it complete the zone.

jbarrerobuch commented 3 years ago

i'm trying to propose a fix but, I'm quite new on this. I cannot get the xpath for the text: "Lo sentimos, la dirección que has introducido en tu navegador no corresponde a ninguna página de idealista."

[(https://www.idealista.com/inmueble/94145922/)]

that's the key to acknowledge if the house link has been removed.

marcosrullan commented 3 months ago

Hello, Thank you very much for participating and showing us a bug. To be honest, this was a career project and once it was submitted we didn't pay attention to it. Sorry it took me so long to get back to you but I was pleased to read that someone was interested in how it worked.

jbarrerobuch commented 3 months ago

Hey Marcos!

No problem, actually I was using it for learning. I have to say that it worked great for this purpose and by that time I wasn’t able to fix the bug. I’m looking to return to this project and implement the use of a proxy list because idealista blocks my requests after few ones.

Anyway thanks for your answer! Really interesting subject. I’m looking to get offers from all over Spain to analize the market 😝.

El El lun, 27 may 2024 a las 10:18, Marcos Rullan @.***> escribió:

Hello, Thank you very much for participating and showing us a bug. To be honest, this was a career project and once it was submitted we didn't pay attention to it. Sorry it took me so long to get back to you but I was pleased to read that someone was interested in how it worked.

— Reply to this email directly, view it on GitHub https://github.com/trueuoc/spa_housing_crawler/issues/1#issuecomment-2132915638, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARVSQ33QD4PQRXQNH7RXJQLZELT6RAVCNFSM46W5D6U2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJTGI4TCNJWGM4A . You are receiving this because you authored the thread.Message ID: @.***>