medialab / artoo

artoo.js - the client-side scraping companion.
http://medialab.github.io/artoo/
MIT License
1.1k stars 93 forks source link

Add an example of use with Node.js #166

Closed eric-brechemier closed 9 years ago

eric-brechemier commented 9 years ago

I am currently teaching myself how to use artoo from the command line; it would be nice to add an example of use in the online documentation.

As a starting point, I will post the steps that I followed in the comments, working on a real-life example. My task is to scrape author profile details from 228 HTML files in the same format, saved locally as chapter*-ca*-profile/*.html in this folder: https://github.com/medea-project/ipcc-fact-checking/tree/master/mitigation2014.org/2014-ipcc-ar5-wg3

I could scrape the information of interest in a browser, but repeating the procedure 228 times would be cumbersome.

I expect to prepare the scraping instructions on one file interactively running artoo in a browser, then apply the same steps running artoo from the command line in Node.js.

eric-brechemier commented 9 years ago

Installation

I could have installed Node.js from the official download. Working on Mac OSX with brew already installed, I chose to install node with brew instead.

brew install node
node -v
npm -v

The name of the npm package for artoo is artoo-js. I chose to install artoo locally, in an ancestor folder of the directory with my data sets:

# go to my workspace folder
cd ../../../../..
# install artoo locally in node_modules folder here
npm install artoo-js
# go back to the folder with data sets
cd medea/data/ipcc-fact-checking/mitigation2014.org/2014-ipcc-ar5-wg3
# check artoo installation
node -e "console.log( require('artoo-js').version )"
eric-brechemier commented 9 years ago

Loading the HTML

On Node.js, artoo is built on top of cheerio, which is a lightweight alternative to JSDOM, replacing the DOM API with a subset of jQuery.

cheerio expects the HTML to be provided as a string in a call to cheerio.load("...")

Although cheerio is a dependency of artoo-js, it must also be installed explicitly to allow require('cheerio') to find it in node_modules:

# go back to ancestor workspace folder
cd ../../../../..
npm install cheerio
# return to data sets folder
cd medea/data/ipcc-fact-checking/mitigation2014.org/2014-ipcc-ar5-wg3
node -e "console.log( require('cheerio').version )"
eric-brechemier commented 9 years ago

Listing HTML files

The File System API of Node.js is very low level.

To simplify the task, let's install the glob module from npm:

# go back to ancestor workspace folder
cd ../../../../..
npm install glob
ls node_modules
# return to data sets folder
cd medea/data/ipcc-fact-checking/mitigation2014.org/2014-ipcc-ar5-wg3
# print the relative path of all HTML files of interest
node -e "require('glob')('chapter*-ca*-profile/*.html',function(err,matches){ console.log(matches) })"
eric-brechemier commented 9 years ago

Preparing Scraping Instructions

I then opened one of the 228 HTML pages in a browser (Firefox) and loaded artoo into the page using its bookmarklet.

Using the HTML inspector of Firebug, I selected nodes of interest visually in the page, copied their CSS path in the contextual menu of the node highlighted in the DOM tree, then simplified the selector to make it shorter, starting from the end until the first id selector is met:

// before
html.js.flexbox.canvas.canvastext.webgl.no-touch.geolocation.postmessage.no-websqldatabase.indexeddb.hashchange.history.draganddrop.websockets.rgba.hsla.multiplebgs.backgroundsize.borderimage.borderradius.boxshadow.textshadow.opacity.cssanimations.csscolumns.cssgradients.no-cssreflections.csstransforms.csstransforms3d.csstransitions.fontface.generatedcontent.video.audio.localstorage.sessionstorage.webworkers.applicationcache.svg.inlinesvg.smil.svgclippaths body.template-document_view.portaltype-document.site-Plone.section-front-page.icons-on.userrole-manager.userrole-authenticated.userrole-owner div#clouds div#visual-portal-wrapper div#portal-columns.row div#main div#big_content div#portal-column-content.cell.width-3:4.position-1:4 div div#content h1
html.js.flexbox.canvas.canvastext.webgl.no-touch.geolocation.postmessage.no-websqldatabase.indexeddb.hashchange.history.draganddrop.websockets.rgba.hsla.multiplebgs.backgroundsize.borderimage.borderradius.boxshadow.textshadow.opacity.cssanimations.csscolumns.cssgradients.no-cssreflections.csstransforms.csstransforms3d.csstransitions.fontface.generatedcontent.video.audio.localstorage.sessionstorage.webworkers.applicationcache.svg.inlinesvg.smil.svgclippaths body.template-document_view.portaltype-document.site-Plone.section-front-page.icons-on.userrole-manager.userrole-authenticated.userrole-owner div#clouds div#visual-portal-wrapper div#portal-columns.row div#main div#big_content div#portal-column-content.cell.width-3:4.position-1:4 div div#content div.person_content p span
html.js.flexbox.canvas.canvastext.webgl.no-touch.geolocation.postmessage.no-websqldatabase.indexeddb.hashchange.history.draganddrop.websockets.rgba.hsla.multiplebgs.backgroundsize.borderimage.borderradius.boxshadow.textshadow.opacity.cssanimations.csscolumns.cssgradients.no-cssreflections.csstransforms.csstransforms3d.csstransitions.fontface.generatedcontent.video.audio.localstorage.sessionstorage.webworkers.applicationcache.svg.inlinesvg.smil.svgclippaths body.template-document_view.portaltype-document.site-Plone.section-front-page.icons-on.userrole-manager.userrole-authenticated.userrole-owner div#clouds div#visual-portal-wrapper div#portal-columns.row div#main div#big_content div#portal-column-content.cell.width-3:4.position-1:4 div div#content div.person_content p span
html.js.flexbox.canvas.canvastext.webgl.no-touch.geolocation.postmessage.no-websqldatabase.indexeddb.hashchange.history.draganddrop.websockets.rgba.hsla.multiplebgs.backgroundsize.borderimage.borderradius.boxshadow.textshadow.opacity.cssanimations.csscolumns.cssgradients.no-cssreflections.csstransforms.csstransforms3d.csstransitions.fontface.generatedcontent.video.audio.localstorage.sessionstorage.webworkers.applicationcache.svg.inlinesvg.smil.svgclippaths body.template-document_view.portaltype-document.site-Plone.section-front-page.icons-on.userrole-manager.userrole-authenticated.userrole-owner div#clouds div#visual-portal-wrapper div#portal-columns.row div#main div#big_content div#portal-column-content.cell.width-3:4.position-1:4 div div#content div.person_content p span
html.js.flexbox.canvas.canvastext.webgl.no-touch.geolocation.postmessage.no-websqldatabase.indexeddb.hashchange.history.draganddrop.websockets.rgba.hsla.multiplebgs.backgroundsize.borderimage.borderradius.boxshadow.textshadow.opacity.cssanimations.csscolumns.cssgradients.no-cssreflections.csstransforms.csstransforms3d.csstransitions.fontface.generatedcontent.video.audio.localstorage.sessionstorage.webworkers.applicationcache.svg.inlinesvg.smil.svgclippaths body.template-document_view.portaltype-document.site-Plone.section-front-page.icons-on.userrole-manager.userrole-authenticated.userrole-owner div#clouds div#visual-portal-wrapper div#portal-columns.row div#main div#big_content div#portal-column-content.cell.width-3:4.position-1:4 div div#content div.roles a.link-category

// after
#content h1
#content div.person_content p span
#content div.person_content p span
#content div.person_content p span
#content div.roles a.link-category

In the Firebug console, I then prepared the following scraping instructions incrementally:

artoo.scrape("#content",{
  "Name": {sel:"h1"},
  "Organization": {sel:"div.person_content p span:eq(0)"},
  "Affiliation": {sel:"div.person_content p span:eq(1)"},
  "Citizenship": {sel:"div.person_content p span:eq(2)"},
  "Roles": {sel:"div.roles a.link-category",method:function(){
    return this.map(function(a){
      return a.firstChild.nodeValue;
    })
  }}
});

I customized the selectors for Organization, Affiliation and Citizenship using jQuery :eq() selector to select the first (offset 0), second (offset 1) and third (offset 2) of the nodes matched respectively.

The list of Roles is an array; I expect to drop it before exporting to CSV since this information is already available separately. An alternative would have been to convert this list to a string with an additional separator, e.g. "|".

Running these scraping instructions in Node.js using artoo-js and cheerio fails with the error SyntaxError: unmatched pseudo-class :eq. This is not too surprising since :eq() is documented as * a jQuery extension and not part of the CSS specification* while cheerio claims only to implement a subset of core jQuery, and states also:

(...) This selector method is the starting point for traversing and manipulating the document. Like jQuery, it's the primary method for selecting elements in the document, but unlike jQuery it's built on top of the CSSSelect library, which implements most of the Sizzle selectors.

I thus rewrote the CSS selectors to use the standard :nth-of-type() (1-based) selector after the p elements instead of the non-standard jQuery extension :eq() (0-based) selector at the end of the expression. Removing scraping of Roles, the instructions become:

artoo.scrape("#content",{
  "Name": {sel:"h1"},
  "Organization": {sel:"div.person_content p:nth-of-type(1) span"},
  "Affiliation": {sel:"div.person_content p:nth-of-type(2) span"},
  "Citizenship": {sel:"div.person_content p:nth-of-type(3) span"}
});
Yomguithereal commented 9 years ago

Hello @eric-brechemier, Note that artoo's node.js version is very experimental and will only support the use of the scrape, scrapeOne and scrapeTable methods through a cheerio selection. The documentation has not be written yet but should be available in a near future.

Can you assert that you just need those methods and won't need browser javascript execution to scrape your files?

eric-brechemier commented 9 years ago

Can you assert that you just need those methods and won't need browser javascript execution to scrape your files?

Yes, just artoo.scrape() and artoo.helpers.toCSVString() basically.

Yomguithereal commented 9 years ago

toCSVString is not currently available through node. But is should be easily done. But if you need it, I'll have to modify the lib and won't be able to push on npm any time soon, so you'll have to install the package through git npm i git+https....

Also, I should probably brief you on how to use the scraping utilities with cheerio.

eric-brechemier commented 9 years ago

@Yomguithereal good to know. So, no toCSVString then; I can do without it.

What should I expect of scrape() running in Node.js?

Yomguithereal commented 9 years ago

Here is what you need to know:

var cheerio = require('cheerio');

var $ = cheerio.load('my-xml-string');

// If you require artoo-js, some methods will be bootstrapped automatically on
// cheerio instances:
var artoo = require('artoo-js');

$('ul > li').scrape(params);

// Else you can do
artoo.scrape($('ul > li'), params);

// Or set current artoo cheerio context
artoo.setContext($);

// And use it like in the browser
artoo.scrape('ul > li', params);

Do you have any place where you would be stocking the code you are writing?

eric-brechemier commented 9 years ago

@Yomguithereal Cool, thanks. I'm on the right track then.

eric-brechemier commented 9 years ago

The final script looks like this:

/*
  Scrape CA profiles
  Read: chapter*-ca*-profile/*.html
  Write: chapter*-ca*-profile/data.csv

  Dependencies (to install with npm):
  artoo-js, cheerio, glob

  Run (from the script folder, with Node.js):
  node scrape-ca-profiles.js

  Background Story:
  https://github.com/medialab/artoo/issues/166
*/

var artoo = require('artoo-js');
(function(){
  // export artoo to global context
  this.artoo = artoo;
})();
// load additional methods in artoo.helpers
// including artoo.helpers.toCSVString()
require('artoo-js/src/artoo.helpers.js');

var glob = require('glob');
var fs = require('fs');
var cheerio = require('cheerio');
var path = require('path');

glob("*-ca*-profile/*.html", function(err,matches){
  matches.forEach(function(inputFileName){
    console.log("Read: "+inputFileName);
    var fileText = fs.readFileSync(inputFileName,{encoding:'utf8'});

    console.log("Parse: "+fileText.slice(0,50)+"...");
    var $ = cheerio.load(fileText);
    artoo.setContext($);

    var data = artoo.scrape("#content",{
      "Name": {sel:"h1"},
      "Organization": {sel:"div.person_content p:nth-of-type(1) span"},
      "Affiliation": {sel:"div.person_content p:nth-of-type(2) span"},
      "Citizenship": {sel:"div.person_content p:nth-of-type(3) span"}
    });
    var csv = artoo.helpers.toCSVString(data);
    console.log("Scraped: "+csv);

    var outputFileName = path.dirname(inputFileName)+path.sep+'data.csv';
    console.log("Save: "+outputFileName);
    fs.writeFileSync(outputFileName,csv,{encoding:'utf8'});
  });
  console.log("Complete");
});

It runs from the same folder as the script, at the same level as the data sets folders, with:

node scrape-ca-profles.js
Yomguithereal commented 9 years ago

So you do need toCSVString in the end, don't you? I should include some of the most useful helpers in the node version then.

You should highlight your code in your comment (wrap it with likewise):

```js
// Some javascript code
console.log('hello');

It will render as:

``` js
// Some javascript code
console.log('hello');

Also writing $('#content').scrape(params); will save you the artoo.setContext($) line if you feel this one in awkward.

eric-brechemier commented 9 years ago

@Yomguithereal Mission accomplished! Thanks a lot for your help.

So you do need toCSVString in the end, don't you? I should include some of the most useful helpers in the node version then.

I needed some kind of helper to serialize CSV sure; I just expected to find what I needed in npm... I found out that the helper was actually included in the code distributed with artoo-js npm module, although not in a very straightforward way; I hope that you appreciated my expert monkey-patching of the helper back in the library ;)

You should highlight your code in your comment (...)

Good point. I will update the examples later today.

Also writing $('#content').scrape(params); will save you the artoo.setContext($) line if you feel this one in awkward.

That's actually fine. I like the way it makes it explicit that the context document changes in each iteration.

Yomguithereal commented 9 years ago

I've just pushed a commit that gives access to more helpers such as toCSVString. To install the latest version of artoo for node through git, you can use the following command:

npm i git+https://github.com/medialab/artoo.git

You should consider using a package.json file to indicate the depencies of your script, by the way.

eric-brechemier commented 9 years ago

I have updated the code samples for syntax highlighting.

Yomguithereal commented 9 years ago

I added the Node.js part of the documentation. Could you please tell me if you find it helpful enough?

eric-brechemier commented 9 years ago

Thanks a lot @Yomguithereal. The documentation is sufficient to be helpful. There is not much missing. Please find my reading notes below:

Scraping with cheerio and artoo.js

The example in Usage should probably read myHTMLString instead of myXMLString, especially since the selectors in the same example match HTML elements. And it would help to include the actual reading of the HTML, either from a URL or a local file, to give a hint for developers coming from the front-end and not used to doing this part.

Also, it is not clear that 1), 2) and 3) are alternatives. At first, they look like three consecutive steps. It might be clearer if you separate the example into three, or if just keep the one form that you want to promote and only mention the other forms in passing in the text below, like you did with scrapeOne and scrapeTable.

Most of the library's helpers (...)

I would rather have a limited list of supported helpers than a vague statement here. Or the list of helpers which are not supported (because they are not relevant).

You can access their paths in node likewise if needed:

I don't get this part. And the example doesn't show much: are artoo.paths.chrome and artoo.paths.phantom strings? What is the intended usage?