HTML/XML parser and web scraper for NodeJS.
Uses native libxml C bindings
Clean promise-like interface
Supports CSS 3.0 and XPath 1.0 selector hybrids
No large dependencies like jQuery, cheerio, or jsdom
Compose deep and complex data structures
HTML parser features
HTML DOM features
HTTP request features
var osmosis = require('osmosis');
osmosis
.get('www.craigslist.org/about/sites')
.find('h1 + div a')
.set('location')
.follow('@href')
.find('header + div + div li > a')
.set('category')
.follow('@href')
.paginate('.totallink + a.button.next:first')
.find('p > a')
.follow('@href')
.set({
'title': 'section > h2',
'description': '#postingbody',
'subcategory': 'div.breadbox > span[4]',
'date': 'time@datetime',
'latitude': '#map@data-latitude',
'longitude': '#map@data-longitude',
'images': ['img@src']
})
.data(function(listing) {
// do something with listing data
})
.log(console.log)
.error(console.log)
.debug(console.log)
For documentation and examples check out https://rchipka.github.io/node-osmosis/global.html
Please consider a donation if you depend on web scraping and Osmosis makes your job a bit easier. Your contribution allows me to spend more time making this the best web scraper for Node.