nrabinowitz / pjscrape

A web-scraping framework written in Javascript, using PhantomJS and jQuery
http://nrabinowitz.github.io/pjscrape/
MIT License
997 stars 159 forks source link

preSuite(page) function #4

Open nrabinowitz opened 12 years ago

nrabinowitz commented 12 years ago

Add an option for a preSuite function, in the PhantomJS environment, with page passed in as an argument, to support things like session-based authentication before a scrape

macedd commented 12 years ago

Any plans for this? Think its a big gap not have control over the phantomjs objects. I can help!

nrabinowitz commented 12 years ago

I agree - I just haven't had much time for this project lately. If you'd like to contribute, please fork and submit a pull request - that would be great!

macedd commented 12 years ago

I take a deep look at the code and it is not very clear to me how to easy accomplish this "preSuite", which would allow form authentication for example. If you have a clue please let me know.

nrabinowitz commented 12 years ago

My thought was that this would be pretty simple - just a hook for an arbitrary function to run before the suites started, say here, passing in SuiteManager.getPage(). This would allow arbitrary pre-suite automation, e.g. logging in, using the WebPage object that will then get used in the scraper suites.

macedd commented 12 years ago

Hi nrabinowitz, thank you for still care this great piece of software. I take in consideration your point, studyng a bit more of phantom/pjscrape , and maybe we aren't on the track yet with "preSuite".

A login process with phantom is a step-by-step waitFor-like action (http://groups.google.com/group/phantomjs/browse_thread/thread/db4cfc37caf0213c#) Also this steps should be inside the opened page (WebPage.open()) because its there cookies and session exists (cannot confirm with documentation, only from examples). Then, in the current implementation of pjscrape, with moreUrls and Suites beeing all page opened (and losing context) we cannot bind authentication sessions to the scrape in a straightforward manner.

Fortunately setting up the WebPage objects may be as simple as implemented on the pageSettings pull (can be improved), but things like step-by-step navigation or a login isn't that simple (IMHO).

So in my view we must refactor the code allowing use cases like these we are thinking on and also implementing these new tools to the library.

nrabinowitz commented 12 years ago

Ok, I see the point that it needs to handle an asynchronous process. But it doesn't make any sense for it to happen within WebPage.open(), because that would restrict it to a single page, and what's needed is a multiple-page process. I haven't tested it, but I'm pretty sure cookies and session are attached to the WebPage object and will survive multiple open calls (otherwise what's the point?).

I'm not in favor of a big refactor at this point, and I don't think I see the requirement. I think what we need is preSuite(page, callback) - this allows you to do whatever async initialization you want, then invoke the callback when you're done. The callback would then kick off the suite runner and run as usual.

donl commented 11 years ago

I exposed getPage in the pjs namespace and disabled the pjs.init() in pjscrape.

This quick hack allowed me to retrieve the shared webpage object, do some custom authentication and start the pjs.init process myself once my custom needs where met. So far, it looks like it'll do the trick for my needs.

I like the idea of something more elegant like the preSuite(page, callback) - though it is beyond me at the moment :)

devloic commented 10 years ago

@donl do you still have the modified script pjscrape.js available and could you publish it ? thx

nathanielrindlaub commented 7 years ago

has there been any forward progress on this? I'm also interested in scraping a page that requires login credentials and would appreciate any advice or guidance!