Issues due to SitePrism's lazy approach

algodave commented 9 years ago

@jeffdeville @safeforge Let's discuss about the following.

SitePrism doesn't actually fetch any element value until that method is invoked (lazy approach). This prevents us from using our scrapers as they're currently defined, meaning: with SitePrism DSL only. It occurs in the Pipeline that Redfin scraper is invoked first, then Trulia scraper is invoked; when Redfin mapper tries to read an element's text (e.g. basic_info.floors.text) it founds the Trulia page in the Capybara session.

What I suggest as a solution is making our scrapers stateful, meaning let's extract texts we need from Capybara elements right after the page is loaded.

for each SitePrism element we should have a String instance variable holding its text value
for each SitePrism section, we should have a Hash instance variable holding its elements values
for each of the above 2 we should have an attr_reader to make them available to callers

Looking forward for your feedback!

jeffdeville commented 9 years ago

Wow, I have to confess I hadn't foreseen so many issues when doing this with Capybara. Darn global variables...

Well, we seem to have 2 options:

What @algodave suggests
We could also scrape/map each site in series. So it'd be:

Mapper::Redfin.map(
  SherlockHomes::Scraper::Redfin.find(property_url)
)

and then instead of step 1 being scrape everything, and step 2 being to map everything, it'd be separate steps for each provider.

{
search_redfin: {success: search_trulia}},
search_trulia: { success: search_zillow}}
}

I think option 2 would mean less code, because you don't have to extract the data into another structure, but I'm ok with either approach

algodave commented 9 years ago

@jeffdeville option 2 was pretty straightforward, I just shared it

jeffdeville / sherlock_homes

Issues due to SitePrism's lazy approach #16