matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.88k stars 348 forks source link

x-ray

Last version Build Status Coverage Status Dependency status Dev Dependencies Status NPM Status Node version OpenCollective OpenCollective Gitter

var Xray = require('x-ray')
var x = Xray()

x('https://blog.ycombinator.com/', '.post', [
  {
    title: 'h1 a',
    link: '.article-title@href'
  }
])
  .paginate('.nav-previous a@href')
  .limit(3)
  .write('results.json')

Installation

npm install x-ray

Features

Selector API

xray(url, selector)(fn)

Scrape the url for the following selector, returning an object in the callback fn. The selector takes an enhanced jQuery-like string that is also able to select on attributes. The syntax for selecting on attributes is selector@attribute. If you do not supply an attribute, the default is selecting the innerText.

Here are a few examples:

xray('http://google.com', 'title')(function(err, title) {
  console.log(title) // Google
})
xray('http://reddit.com', '.content')(fn)
xray('http://techcrunch.com', 'img.logo@src')(fn)
xray('http://news.ycombinator.com', 'body@html')(fn)

xray(url, scope, selector)

You can also supply a scope to each selector. In jQuery, this would look something like this: $(scope).find(selector).

xray(html, scope, selector)

Instead of a url, you can also supply raw HTML and all the same semantics apply.

var html = '<body><h2>Pear</h2></body>'
x(html, 'body', 'h2')(function(err, header) {
  header // => Pear
})

API

xray.driver(driver)

Specify a driver to make requests through. Available drivers include:

xray.stream()

Returns Readable Stream of the data. This makes it easy to build APIs around x-ray. Here's an example with Express:

var app = require('express')()
var x = require('x-ray')()

app.get('/', function(req, res) {
  var stream = x('http://google.com', 'title').stream()
  stream.pipe(res)
})

xray.write([path])

Stream the results to a path.

If no path is provided, then the behavior is the same as .stream().

xray.then(cb)

Constructs a Promise object and invoke its then function with a callback cb. Be sure to invoke then() at the last step of xray method chaining, since the other methods are not promisified.

x('https://dribbble.com', 'li.group', [
  {
    title: '.dribbble-img strong',
    image: '.dribbble-img [data-src]@data-src'
  }
])
  .paginate('.next_page@href')
  .limit(3)
  .then(function(res) {
    console.log(res[0]) // prints first result
  })
  .catch(function(err) {
    console.log(err) // handle error in promise
  })

xray.paginate(selector)

Select a url from a selector and visit that page.

xray.limit(n)

Limit the amount of pagination to n requests.

xray.abort(validator)

Abort pagination if validator function returns true. The validator function receives two arguments:

xray.delay(from, [to])

Delay the next request between from and to milliseconds. If only from is specified, delay exactly from milliseconds.

var x = Xray().delay('1s', '10s')

xray.concurrency(n)

Set the request concurrency to n. Defaults to Infinity.

var x = Xray().concurrency(2)

xray.throttle(n, ms)

Throttle the requests to n requests per ms milliseconds.

var x = Xray().throttle(2, '1s')

xray.timeout (ms)

Specify a timeout of ms milliseconds for each request.

var x = Xray().timeout(30)

Collections

X-ray also has support for selecting collections of tags. While x('ul', 'li') will only select the first list item in an unordered list, x('ul', ['li']) will select all of them.

Additionally, X-ray supports "collections of collections" allowing you to smartly select all list items in all lists with a command like this: x(['ul'], ['li']).

Composition

X-ray becomes more powerful when you start composing instances together. Here are a few possibilities:

Crawling to another site

var Xray = require('x-ray')
var x = Xray()

x('http://google.com', {
  main: 'title',
  image: x('#gbar a@href', 'title') // follow link to google images
})(function(err, obj) {
  /*
  {
    main: 'Google',
    image: 'Google Images'
  }
*/
})

Scoping a selection

var Xray = require('x-ray')
var x = Xray()

x('http://mat.io', {
  title: 'title',
  items: x('.item', [
    {
      title: '.item-content h2',
      description: '.item-content section'
    }
  ])
})(function(err, obj) {
  /*
  {
    title: 'mat.io',
    items: [
      {
        title: 'The 100 Best Children\'s Books of All Time',
        description: 'Relive your childhood with TIME\'s list...'
      }
    ]
  }
*/
})

Filters

Filters can specified when creating a new Xray instance. To apply filters to a value, append them to the selector using |.

var Xray = require('x-ray')
var x = Xray({
  filters: {
    trim: function(value) {
      return typeof value === 'string' ? value.trim() : value
    },
    reverse: function(value) {
      return typeof value === 'string'
        ? value
            .split('')
            .reverse()
            .join('')
        : value
    },
    slice: function(value, start, end) {
      return typeof value === 'string' ? value.slice(start, end) : value
    }
  }
})

x('http://mat.io', {
  title: 'title | trim | reverse | slice:2,3'
})(function(err, obj) {
  /*
  {
    title: 'oi'
  }
*/
})

Examples

In the Wild

Resources

Backers

Support us with a monthly donation and help us continue our activities. [Become a backer]

Sponsors

Become a sponsor and get your logo on our website and on our README on Github with a link to your site. [Become a sponsor]

License

MIT