matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.88k stars 350 forks source link

Passing cookies or request headers #106

Closed patrickarlt closed 8 years ago

patrickarlt commented 9 years ago

This might be a duplicate of https://github.com/lapwinglabs/x-ray/issues/91.

I'm using x-ray to build a link checker for a large production site. It is working great but I can't use it to test our development site because we keep it behind a password protected splash screen.

If I could set a cookie when using x-ray I could make this work. Digging around the code a little I see your setting headers on https://github.com/lapwinglabs/x-ray-crawler/blob/03b89901e9857925d80a0e5b80fdbe297510789b/lib/http-driver.js#L26 but I cant figure out where that is coming from.

ghost commented 8 years ago

You probably have already solved this, but here's my solution for anybody who might stumble here.

Switch the driver to something you have control over. Here's a quick and dirty example, I hope it's clear.

var R = require('ramda'),
    Promise = require('bluebird'),
    xray = require('x-ray'),
    _request = require('request');

var request = _request.defaults({
    jar: true, 
    headers: {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
});

var prequest = (opts) => new Promise((resolve, reject) => request(opts, (err, res, body) => R.isNil(err) ? resolve(body) : reject(err)));

// x-ray request "driver" wow so pompous much pretentiousness
var request_driver = (config) => {
    var options = config || {};
    return (ctx) => prequest(R.merge(options, {uri: ctx.url}));
};

var pray = (url, selector, def) => {
   return new Promise((resolve, reject) => {
        var x = xray().driver(request_driver());
        x(url, selector, def)((err, obj) => R.isNil(err) ? resolve(obj) : reject(err));
   });
};

pray('http://imgur.com/search?q=doge', 'div.cards', {image: ['img@src']}).then(console.log);

You can use request.defaults or pass the options to request_driver() inside the .driver() call, like this:

var x = xray().driver(request_driver({
    jar: true, 
    headers: {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
}));

Of course the phantom driver by the author might work. I haven't tried it because it depends on stuff that I don't understand :/

Kikobeats commented 8 years ago

Working on that under #51