medialab / artoo

artoo.js - the client-side scraping companion.
http://medialab.github.io/artoo/
MIT License
1.1k stars 93 forks source link

Add ability to scrape external html #222

Open oveddan opened 9 years ago

oveddan commented 9 years ago

Lets say we have a script that downloads external html, and we want to scrape it. It would be great if artoo could be used for this, instead of always scraping the current page.

Yomguithereal commented 9 years ago

Hello @oveddan, actually you can already do that. The artoo.scrape method can be given a jQuery selector rather than a string describing the selection so you can apply to other parsed html.

There is even a little helper called artoo.helpers.jquerify that will parse a lot of different HTML/XML strings without having to feed them to the DOM and avoid silly bugs like loading images etc.

Alternatively you can use $(sel).scrape to do the job where $ points to another document.

Example

var htmlString = '<div>Hey</div>';

// Parsing the html with caution (or through your own way if you prefer)
var $doc = artoo.helpers.jquerify(htmlString);

// Then scrape likewise
var data = artoo.scrape($doc.find('div'), ...});

// or
var data = $doc.find('div').scrape(...);

As a side note, the ajax spider already have some sugar to fetch parse and scrape the targeted url's html.

Yomguithereal commented 9 years ago

You can even use artoo with node.js and cheerio if needed.

oveddan commented 9 years ago

I can't include jQuery in my project - is there a way to do it without jQuery?

Yomguithereal commented 9 years ago

No, currently jQuery is a dependency of artoo so if you remove jQuery, it won't work. It might work with subsets of kind of subsets like Zepto etc. But, why can't you include jQuery in your project? Do you have another variable named $?

oveddan commented 9 years ago

Compatibility issues with react-native. I was hoping for a simple lightweight dom parsing library independent of jquery.

Yomguithereal commented 9 years ago

Ok, I understand your use case better now. Do you have a precise library in mind to perform the parsing? My aim, lately, was to rebuild the library on at least commonJS so I could clearly define and require different parts of the library. I guess that what you are interested in is artoo.scrape.

I could fiddle something to see whether it would be easy to hook the scraping utilities on some other parsing library.

oveddan commented 9 years ago

Sadly, jquery is the best parsing library I know! So I can see why it's a natural fit for artoo. There is the html5 DOMParser But all that does is convert html into a dom document.

oveddan commented 9 years ago

Would artoo.scrape work without jquery?

oveddan commented 9 years ago

Also possible to leverage querySelector

Yomguithereal commented 9 years ago

For the time being, artoo.scrape wouldn't work without jQuery. But I guess it could be possible to fiddle something to leverage querySelector (while of course dropping all of the jQuery sugar in the process).

Other options could be to use Sizzle (the jQuery parser) alone, or even see whether cheerio could work in a browser env (but I assume you are not strictly in a browser env in your use case). Btw, what are the restrictions imposed by react-native concerning the libraries you may use?

Yomguithereal commented 9 years ago

Note that if cheerio can be used with react-native then you can use the node version of artoo in your case.

oveddan commented 9 years ago

The main restriction is actually in Sizzle - it attempts to call document.createElement, but that's not supported in react-native.

Yomguithereal commented 9 years ago

Ok. So I guess you should try cheerio with artoo's node version then. It does not rely on the DOM whatsoever and is actually faster than Sizzle.

Yomguithereal commented 9 years ago

Any update about this @oveddan?

oveddan commented 9 years ago

Thanks for following up @Yomguithereal

I actually got jQuery working within jscore (which is what react-native uses) by building jQuery and modifying one line -https://github.com/jquery/jquery/blob/66e1b6b8d49812239b5712d65922ff94c60f7b02/src/intro.js#L25 and removing global.document ?, because with jscore there is a global document, but it does not have createElement which is what sizzle is looking for and breaks when it doesn't exist.

I think artoo would almost get me there - the challenge it seems is that with artoo in node you do:

var $ = cheerio.load(myXMLString);
// Setting artoo's context
artoo.setContext($);

This essentially binds the global context of artoo to one document - when I could be scraping multiple documents concurrently.

It would be better if it worked like:

var $ = cheerio.load(myXMLString);
// Setting artoo's context:
var context = artoo.Context($);
var data = context.scrape('ul > li', params);

Just as with jQuery in node - you do:

var $ = require('jquery')(window);
Yomguithereal commented 9 years ago

Hello @oveddan. For the case where you want to scrape from multiple contexts, I usually prefer the following:

var artoo = require('artoo-js'),
    cheerio = require('cheerio');

artoo.bootstrap(cheerio);

var $ = cheerio.load('whatever');

// Then either
var data = $(sel).scrape({...});
// or
var data = artoo.scrape($(sel), {...});