matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.87k stars 349 forks source link

Support Proxy #1

Closed AlbanMinassian closed 5 years ago

AlbanMinassian commented 9 years ago

how scrap behind a proxy (with login and password) ? Thank AMi44

matthewmueller commented 9 years ago

hmm... not exactly sure what the proxy setup is but you'll probably need the http://github.com/lapwinglabs/x-ray-phantom driver. From there you can click around and enter passwords. you can find more docs here: http://github.com/segmentio/nightmare.

If you give me more info, I could try and help further.

dzcpy commented 9 years ago

Thanks for the reply. Can we use superagent-proxy?

dzcpy commented 9 years ago

I've never dived into the code too much, but I'm wondering if something could be done like this:

var request = require('superagent');
var xray = require('x-ray');
require('superagent-proxy')(request);

xray
  .use(request)
  .proxy('...')
  .select([...])
AlbanMinassian commented 9 years ago
    .proxy('http://login:password@web.proxy:8080')
     ^
TypeError: Object #<Xray> has no method 'proxy'
matthewmueller commented 9 years ago

Yah, there is no proxy support atm. I haven't had much luck with superagent-proxy yet, though I'm probably doing something wrong.

Running over http works:

var request = require('superagent');
require('superagent-proxy')(request);

request
  .get('http://google.com/')
  .proxy('socks://localhost:9050')
  .end(function(err, res) {
    if (err) throw err;
    console.log(res.text);
  })

but running over https doesn't seem to work:

var request = require('superagent');
require('superagent-proxy')(request);

request
  .get('https://google.com/')
  .proxy('socks://localhost:9050')
  .end(function(err, res) {
    if (err) throw err;
    console.log(res.text);
  })

Error:

Error: write EPROTO 140735185163008:error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol:../deps/openssl/openssl/ssl/s23_clnt.c:787:

/cc @tootallnate

dzcpy commented 9 years ago

@MatthewMueller superagent-proxy doesn't support socks 5 proxy, it can only use socks 4a which is a pretty limited protocol. I'm working on a PR to solve this problem (it's nearly done and the above code you provided works perfectly, just need to test a bit more)

dzcpy commented 9 years ago

Just for your info, superagent-proxy supports socks5 proxies now. Do you have any plan for support using proxies for superagent driver?

matthewmueller commented 9 years ago

Thanks for pushing forward on this!

Still a bit more to do:

I updated everything manually to test things out and unfortunately I still can't get the original example working.

var request = require('superagent');
require('superagent-proxy')(request);

request
  .get('https://google.com/')
  .proxy('socks://localhost:9050')
  .end(function(err, res) {
    if (err) throw err;
    console.log(res.text);
  })

Still yielding:

Error: write EPROTO 140735130649344:error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol:../deps/openssl/openssl/ssl/s23_clnt.c:787:

Now I definitely may have messed up the updating, so it's worth trying yourself. Once I can get this example working, I'll create a repository called x-ray-proxy which will add a method for xray.proxy(...).

/cc @TooTallNate

rotatingJazz commented 9 years ago

@MatthewMueller

I just created a new project, npm install superagent, npm install superagent-proxy, launched a local SOCKS5 proxy via OpenSSH, copy pasted the code and worked as expected (got G page html output).

So :+1: on that branch! :beer:

matthewmueller commented 9 years ago

@rotatingJazz ahh interesting, probably did something wrong. did you try it with Tor?

rotatingJazz commented 9 years ago

@MatthewMueller I don't use Tor and don't want to install it on my machine, so I can't test that, sorry :disappointed:

cyclops24 commented 9 years ago

I did this with request module and Freegate software with wine in my linux machine. See this: https://github.com/request/request#controlling-proxy-behaviour-using-environment-variables And this is my code for setting proxy:

process.env.HTTP_PROXY = "http://127.0.0.1:8580";

And one question: For now what is the best solution for proxy using with x-ray??

matthewmueller commented 9 years ago

@cyclops24 best solution would be to create a driver using superagent-proxy.

Here's an example driver: https://github.com/lapwinglabs/x-ray/blob/master/lib/request.js

Should be really simple. API may change in the future though, but I can help update the proxy driver after initial push.

ralyodio commented 9 years ago

I got a 404 on this link @matthewmueller https://github.com/lapwinglabs/x-ray/blob/master/lib/request.js

kelvinnn commented 7 years ago

@matthewmueller Did you manage to get proxy working via superagent-proxy for the latest build? Could you kindly share some sample code? :)

kelvinu commented 7 years ago

Any update on proxies for X-ray? What works for you guys?

IAmStoxe commented 7 years ago

If utilizing x-ray-nightmare is an option it is not difficult to implement. First grab the driver from here:

Then when instantiating the driver like so:

var NightmareElectron = require('x-ray-nightmare');
var Xray = require('x-ray');

var nightmareOptions = {
  switches: {
    'proxy-server': '1.2.3.4:5678', //Proxy here
    'ignore-certificate-errors': true
  }
};

// instantiate driver for later shutdown
var nightmareDriver = NightmareElectron(nightmareOptions);

var x = Xray()
  .driver(nightmareDriver);

x('http://google.com', 'title')(function(err, str) {
  if (err) return done(err);
  assert.equal('Google', str);

  // gracefully shutdown driver
  nightmareDriver();

  done();
})

Read more about Nightmare and the switches available here: https://github.com/segmentio/nightmare/blob/a5e658bf04815bb2c3340fd05d34e2d158f6c7e6/Readme.md#switches

vinaybedre commented 7 years ago

Hi,

The above usage ends with following error. Any clue?

events.js:141 throw er; // Unhandled 'error' event ^

Error at EventEmitter. (/Applications/MAMP/htdocs/nodejs-be/node_modules/x-ray-nightmare/index.js:46:33) at emitMany (events.js:113:20) at EventEmitter.emit (events.js:182:7) at ChildProcess. (/Applications/MAMP/htdocs/nodejs-be/node_modules/x-ray-nightmare/node_modules/nightmare/lib/ipc.js:49:10) at emitTwo (events.js:87:13) at ChildProcess.emit (events.js:172:7) at internal/child_process.js:696:12 at nextTickCallbackWith0Args (node.js:420:9) at process._tickCallback (node.js:349:13)

beautyfree commented 7 years ago
var driver = require('request-x-ray')({
  proxy: '1.2.3.4:5678', // Proxy here
  timeout: 300*1000,
  ca: 'path' // ca if needed
});

var Xray = require('x-ray');

var x = Xray().driver(driver);

x('http://google.com', 'title')(function(err, str) {
  if (err) return done(err);
  assert.equal('Google', str);
  done();
})
lathropd commented 5 years ago

Closing