Closed eric-brechemier closed 9 years ago
Installation
I could have installed Node.js from the official download. Working on Mac OSX with brew
already installed, I chose to install node with brew instead.
brew install node
node -v
npm -v
The name of the npm package for artoo is artoo-js
. I chose to install artoo locally, in an ancestor folder of the directory with my data sets:
# go to my workspace folder
cd ../../../../..
# install artoo locally in node_modules folder here
npm install artoo-js
# go back to the folder with data sets
cd medea/data/ipcc-fact-checking/mitigation2014.org/2014-ipcc-ar5-wg3
# check artoo installation
node -e "console.log( require('artoo-js').version )"
Loading the HTML
On Node.js, artoo is built on top of cheerio, which is a lightweight alternative to JSDOM, replacing the DOM API with a subset of jQuery.
cheerio
expects the HTML to be provided as a string in a call to cheerio.load("...")
Although cheerio
is a dependency of artoo-js
, it must also be installed explicitly to allow require('cheerio')
to find it in node_modules
:
# go back to ancestor workspace folder
cd ../../../../..
npm install cheerio
# return to data sets folder
cd medea/data/ipcc-fact-checking/mitigation2014.org/2014-ipcc-ar5-wg3
node -e "console.log( require('cheerio').version )"
Listing HTML files
The File System API of Node.js is very low level.
To simplify the task, let's install the glob module from npm:
# go back to ancestor workspace folder
cd ../../../../..
npm install glob
ls node_modules
# return to data sets folder
cd medea/data/ipcc-fact-checking/mitigation2014.org/2014-ipcc-ar5-wg3
# print the relative path of all HTML files of interest
node -e "require('glob')('chapter*-ca*-profile/*.html',function(err,matches){ console.log(matches) })"
Preparing Scraping Instructions
I then opened one of the 228 HTML pages in a browser (Firefox) and loaded artoo into the page using its bookmarklet.
Using the HTML inspector of Firebug, I selected nodes of interest visually in the page, copied their CSS path in the contextual menu of the node highlighted in the DOM tree, then simplified the selector to make it shorter, starting from the end until the first id selector is met:
// before
html.js.flexbox.canvas.canvastext.webgl.no-touch.geolocation.postmessage.no-websqldatabase.indexeddb.hashchange.history.draganddrop.websockets.rgba.hsla.multiplebgs.backgroundsize.borderimage.borderradius.boxshadow.textshadow.opacity.cssanimations.csscolumns.cssgradients.no-cssreflections.csstransforms.csstransforms3d.csstransitions.fontface.generatedcontent.video.audio.localstorage.sessionstorage.webworkers.applicationcache.svg.inlinesvg.smil.svgclippaths body.template-document_view.portaltype-document.site-Plone.section-front-page.icons-on.userrole-manager.userrole-authenticated.userrole-owner div#clouds div#visual-portal-wrapper div#portal-columns.row div#main div#big_content div#portal-column-content.cell.width-3:4.position-1:4 div div#content h1
html.js.flexbox.canvas.canvastext.webgl.no-touch.geolocation.postmessage.no-websqldatabase.indexeddb.hashchange.history.draganddrop.websockets.rgba.hsla.multiplebgs.backgroundsize.borderimage.borderradius.boxshadow.textshadow.opacity.cssanimations.csscolumns.cssgradients.no-cssreflections.csstransforms.csstransforms3d.csstransitions.fontface.generatedcontent.video.audio.localstorage.sessionstorage.webworkers.applicationcache.svg.inlinesvg.smil.svgclippaths body.template-document_view.portaltype-document.site-Plone.section-front-page.icons-on.userrole-manager.userrole-authenticated.userrole-owner div#clouds div#visual-portal-wrapper div#portal-columns.row div#main div#big_content div#portal-column-content.cell.width-3:4.position-1:4 div div#content div.person_content p span
html.js.flexbox.canvas.canvastext.webgl.no-touch.geolocation.postmessage.no-websqldatabase.indexeddb.hashchange.history.draganddrop.websockets.rgba.hsla.multiplebgs.backgroundsize.borderimage.borderradius.boxshadow.textshadow.opacity.cssanimations.csscolumns.cssgradients.no-cssreflections.csstransforms.csstransforms3d.csstransitions.fontface.generatedcontent.video.audio.localstorage.sessionstorage.webworkers.applicationcache.svg.inlinesvg.smil.svgclippaths body.template-document_view.portaltype-document.site-Plone.section-front-page.icons-on.userrole-manager.userrole-authenticated.userrole-owner div#clouds div#visual-portal-wrapper div#portal-columns.row div#main div#big_content div#portal-column-content.cell.width-3:4.position-1:4 div div#content div.person_content p span
html.js.flexbox.canvas.canvastext.webgl.no-touch.geolocation.postmessage.no-websqldatabase.indexeddb.hashchange.history.draganddrop.websockets.rgba.hsla.multiplebgs.backgroundsize.borderimage.borderradius.boxshadow.textshadow.opacity.cssanimations.csscolumns.cssgradients.no-cssreflections.csstransforms.csstransforms3d.csstransitions.fontface.generatedcontent.video.audio.localstorage.sessionstorage.webworkers.applicationcache.svg.inlinesvg.smil.svgclippaths body.template-document_view.portaltype-document.site-Plone.section-front-page.icons-on.userrole-manager.userrole-authenticated.userrole-owner div#clouds div#visual-portal-wrapper div#portal-columns.row div#main div#big_content div#portal-column-content.cell.width-3:4.position-1:4 div div#content div.person_content p span
html.js.flexbox.canvas.canvastext.webgl.no-touch.geolocation.postmessage.no-websqldatabase.indexeddb.hashchange.history.draganddrop.websockets.rgba.hsla.multiplebgs.backgroundsize.borderimage.borderradius.boxshadow.textshadow.opacity.cssanimations.csscolumns.cssgradients.no-cssreflections.csstransforms.csstransforms3d.csstransitions.fontface.generatedcontent.video.audio.localstorage.sessionstorage.webworkers.applicationcache.svg.inlinesvg.smil.svgclippaths body.template-document_view.portaltype-document.site-Plone.section-front-page.icons-on.userrole-manager.userrole-authenticated.userrole-owner div#clouds div#visual-portal-wrapper div#portal-columns.row div#main div#big_content div#portal-column-content.cell.width-3:4.position-1:4 div div#content div.roles a.link-category
// after
#content h1
#content div.person_content p span
#content div.person_content p span
#content div.person_content p span
#content div.roles a.link-category
In the Firebug console, I then prepared the following scraping instructions incrementally:
artoo.scrape("#content",{
"Name": {sel:"h1"},
"Organization": {sel:"div.person_content p span:eq(0)"},
"Affiliation": {sel:"div.person_content p span:eq(1)"},
"Citizenship": {sel:"div.person_content p span:eq(2)"},
"Roles": {sel:"div.roles a.link-category",method:function(){
return this.map(function(a){
return a.firstChild.nodeValue;
})
}}
});
I customized the selectors for Organization, Affiliation and Citizenship using jQuery :eq() selector to select the first (offset 0
), second (offset 1
) and third (offset 2
) of the nodes matched respectively.
The list of Roles is an array; I expect to drop it before exporting to CSV since this information is already available separately. An alternative would have been to convert this list to a string with an additional separator, e.g. "|"
.
Running these scraping instructions in Node.js using artoo-js
and cheerio
fails with the error SyntaxError: unmatched pseudo-class :eq
. This is not too surprising since :eq()
is documented as * a jQuery extension and not part of the CSS specification* while cheerio
claims only to implement a subset of core jQuery, and states also:
(...) This selector method is the starting point for traversing and manipulating the document. Like jQuery, it's the primary method for selecting elements in the document, but unlike jQuery it's built on top of the CSSSelect library, which implements most of the Sizzle selectors.
I thus rewrote the CSS selectors to use the standard :nth-of-type()
(1-based) selector after the p
elements instead of the non-standard jQuery extension :eq()
(0-based) selector at the end of the expression. Removing scraping of Roles, the instructions become:
artoo.scrape("#content",{
"Name": {sel:"h1"},
"Organization": {sel:"div.person_content p:nth-of-type(1) span"},
"Affiliation": {sel:"div.person_content p:nth-of-type(2) span"},
"Citizenship": {sel:"div.person_content p:nth-of-type(3) span"}
});
Hello @eric-brechemier,
Note that artoo's node.js version is very experimental and will only support the use of the scrape
, scrapeOne
and scrapeTable
methods through a cheerio selection. The documentation has not be written yet but should be available in a near future.
Can you assert that you just need those methods and won't need browser javascript execution to scrape your files?
Can you assert that you just need those methods and won't need browser javascript execution to scrape your files?
Yes, just artoo.scrape()
and artoo.helpers.toCSVString()
basically.
toCSVString
is not currently available through node. But is should be easily done. But if you need it, I'll have to modify the lib and won't be able to push on npm any time soon, so you'll have to install the package through git npm i git+https...
.
Also, I should probably brief you on how to use the scraping utilities with cheerio.
@Yomguithereal good to know. So, no toCSVString
then; I can do without it.
What should I expect of scrape()
running in Node.js?
Here is what you need to know:
var cheerio = require('cheerio');
var $ = cheerio.load('my-xml-string');
// If you require artoo-js, some methods will be bootstrapped automatically on
// cheerio instances:
var artoo = require('artoo-js');
$('ul > li').scrape(params);
// Else you can do
artoo.scrape($('ul > li'), params);
// Or set current artoo cheerio context
artoo.setContext($);
// And use it like in the browser
artoo.scrape('ul > li', params);
Do you have any place where you would be stocking the code you are writing?
@Yomguithereal Cool, thanks. I'm on the right track then.
The final script looks like this:
/*
Scrape CA profiles
Read: chapter*-ca*-profile/*.html
Write: chapter*-ca*-profile/data.csv
Dependencies (to install with npm):
artoo-js, cheerio, glob
Run (from the script folder, with Node.js):
node scrape-ca-profiles.js
Background Story:
https://github.com/medialab/artoo/issues/166
*/
var artoo = require('artoo-js');
(function(){
// export artoo to global context
this.artoo = artoo;
})();
// load additional methods in artoo.helpers
// including artoo.helpers.toCSVString()
require('artoo-js/src/artoo.helpers.js');
var glob = require('glob');
var fs = require('fs');
var cheerio = require('cheerio');
var path = require('path');
glob("*-ca*-profile/*.html", function(err,matches){
matches.forEach(function(inputFileName){
console.log("Read: "+inputFileName);
var fileText = fs.readFileSync(inputFileName,{encoding:'utf8'});
console.log("Parse: "+fileText.slice(0,50)+"...");
var $ = cheerio.load(fileText);
artoo.setContext($);
var data = artoo.scrape("#content",{
"Name": {sel:"h1"},
"Organization": {sel:"div.person_content p:nth-of-type(1) span"},
"Affiliation": {sel:"div.person_content p:nth-of-type(2) span"},
"Citizenship": {sel:"div.person_content p:nth-of-type(3) span"}
});
var csv = artoo.helpers.toCSVString(data);
console.log("Scraped: "+csv);
var outputFileName = path.dirname(inputFileName)+path.sep+'data.csv';
console.log("Save: "+outputFileName);
fs.writeFileSync(outputFileName,csv,{encoding:'utf8'});
});
console.log("Complete");
});
It runs from the same folder as the script, at the same level as the data sets folders, with:
node scrape-ca-profles.js
So you do need toCSVString
in the end, don't you? I should include some of the most useful helpers in the node version then.
You should highlight your code in your comment (wrap it with likewise):
```js
// Some javascript code
console.log('hello');
It will render as:
``` js
// Some javascript code
console.log('hello');
Also writing $('#content').scrape(params);
will save you the artoo.setContext($)
line if you feel this one in awkward.
@Yomguithereal Mission accomplished! Thanks a lot for your help.
So you do need toCSVString in the end, don't you? I should include some of the most useful helpers in the node version then.
I needed some kind of helper to serialize CSV sure; I just expected to find what I needed in npm... I found out that the helper was actually included in the code distributed with artoo-js
npm module, although not in a very straightforward way; I hope that you appreciated my expert monkey-patching of the helper back in the library ;)
You should highlight your code in your comment (...)
Good point. I will update the examples later today.
Also writing $('#content').scrape(params); will save you the artoo.setContext($) line if you feel this one in awkward.
That's actually fine. I like the way it makes it explicit that the context document changes in each iteration.
I've just pushed a commit that gives access to more helpers such as toCSVString
.
To install the latest version of artoo for node through git, you can use the following command:
npm i git+https://github.com/medialab/artoo.git
You should consider using a package.json
file to indicate the depencies of your script, by the way.
I have updated the code samples for syntax highlighting.
I added the Node.js part of the documentation. Could you please tell me if you find it helpful enough?
Thanks a lot @Yomguithereal. The documentation is sufficient to be helpful. There is not much missing. Please find my reading notes below:
Scraping with cheerio and artoo.js
The example in Usage should probably read myHTMLString
instead of myXMLString
, especially since the selectors in the same example match HTML elements. And it would help to include the actual reading of the HTML, either from a URL or a local file, to give a hint for developers coming from the front-end and not used to doing this part.
Also, it is not clear that 1), 2) and 3) are alternatives. At first, they look like three consecutive steps. It might be clearer if you separate the example into three, or if just keep the one form that you want to promote and only mention the other forms in passing in the text below, like you did with scrapeOne
and scrapeTable
.
Most of the library's helpers (...)
I would rather have a limited list of supported helpers than a vague statement here. Or the list of helpers which are not supported (because they are not relevant).
You can access their paths in node likewise if needed:
I don't get this part. And the example doesn't show much: are artoo.paths.chrome
and artoo.paths.phantom
strings? What is the intended usage?
I am currently teaching myself how to use artoo from the command line; it would be nice to add an example of use in the online documentation.
As a starting point, I will post the steps that I followed in the comments, working on a real-life example. My task is to scrape author profile details from 228 HTML files in the same format, saved locally as
chapter*-ca*-profile/*.html
in this folder: https://github.com/medea-project/ipcc-fact-checking/tree/master/mitigation2014.org/2014-ipcc-ar5-wg3I could scrape the information of interest in a browser, but repeating the procedure 228 times would be cumbersome.
I expect to prepare the scraping instructions on one file interactively running artoo in a browser, then apply the same steps running artoo from the command line in Node.js.