jeffdeville / sherlock_homes

0 stars 0 forks source link

Issues due to SitePrism's lazy approach #16

Open algodave opened 9 years ago

algodave commented 9 years ago

@jeffdeville @safeforge Let's discuss about the following.

SitePrism doesn't actually fetch any element value until that method is invoked (lazy approach). This prevents us from using our scrapers as they're currently defined, meaning: with SitePrism DSL only. It occurs in the Pipeline that Redfin scraper is invoked first, then Trulia scraper is invoked; when Redfin mapper tries to read an element's text (e.g. basic_info.floors.text) it founds the Trulia page in the Capybara session.

What I suggest as a solution is making our scrapers stateful, meaning let's extract texts we need from Capybara elements right after the page is loaded.

Looking forward for your feedback!

jeffdeville commented 9 years ago

Wow, I have to confess I hadn't foreseen so many issues when doing this with Capybara. Darn global variables...

Well, we seem to have 2 options:

  1. What @algodave suggests
  2. We could also scrape/map each site in series. So it'd be:
Mapper::Redfin.map(
  SherlockHomes::Scraper::Redfin.find(property_url)
)

and then instead of step 1 being scrape everything, and step 2 being to map everything, it'd be separate steps for each provider.

{
search_redfin: {success: search_trulia}},
search_trulia: { success: search_zillow}}
}

I think option 2 would mean less code, because you don't have to extract the data into another structure, but I'm ok with either approach

algodave commented 9 years ago

@jeffdeville option 2 was pretty straightforward, I just shared it