matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.87k stars 349 forks source link

User Agent String Support #168

Open arturnt opened 8 years ago

arturnt commented 8 years ago

Subject of the issue

Not a bug, but an ask. It would be great if we could specify the user agent string or perhaps other headers more easily.

mahlu commented 8 years ago

+1

ayy0 commented 8 years ago

+1

i changed mine by editing: \node_modules\superagent\lib\node\index.js

on line '108' (or search 'var ua')

var ua = 'node-superagent/' + pkg.version; to var ua = 'any user agent here';

wildeyes commented 8 years ago

I could help with this feature, but I need a code example of how one might use it.

for example x(...).useragent(UA_STRING)?

0xgeert commented 8 years ago

You could always add a custom driver to do this.

On Thu, Jun 16, 2016 at 5:25 AM, xwildeyes notifications@github.com wrote:

I could help with this feature, but I need a code example of how one might use it.

for example x(...).useragent(UA_STRING)?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lapwinglabs/x-ray/issues/168#issuecomment-226379163, or mute the thread https://github.com/notifications/unsubscribe/AAYAK4DZjbWdxAMtyY4ZCzlV7OUsCg_Bks5qMMI8gaJpZM4IA_ya .

kdekooter commented 7 years ago

How would one add a custom driver?

kelvinu commented 7 years ago

+1 anyone got a solution to this? This would be a useful feature to implement. Or an example of how to swap in the driver that supports user agent

kdekooter commented 7 years ago

I ended up using request to set the User-Agent header and then parsing the response with cheerio (sorry x-ray team, I had to move on).

kelvinu commented 7 years ago

make sense.... I might do that same. Appreciate the response. Do you have any sample code that i can leverage?

thanks

kelvinu commented 7 years ago

what did you use to implement pagination. Will have a play, just looking for some direction. Thx

kdekooter commented 7 years ago

with my solution you miss out on a lot of x-ray goodies like pagination...

kdekooter commented 7 years ago

As per the README it should be doable by using request-x-ray (https://github.com/Crazometer/request-x-ray) as driver.

kelvinu commented 7 years ago

That's great to know. Thx. Will try both ways. Cheerio seems more stable... Probably quite faster to implement pagination myself. On Mon, 7 Nov 2016 at 12:21 AM, Kees de Kooter notifications@github.com wrote:

As per the README it should be doable by using request-x-ray ( https://github.com/Crazometer/request-x-ray) as driver.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lapwinglabs/x-ray/issues/168#issuecomment-258691595, or mute the thread https://github.com/notifications/unsubscribe-auth/ACEDTACtsqNX1srNCRQLpmpEA4vzKEFDks5q7f6SgaJpZM4IA_ya .

kdekooter commented 7 years ago

This is how it should work:

const Xray = require('x-ray')
const x = Xray()
const requestXray = require('request-x-ray')

const options = {
  method: 'GET',
  headers: {
    'User-Agent': 'Foo/1.0'
  }
}

const driver = requestXray(options)
x.driver(driver)

module.exports = {

  getTitle: function (url) {
    return new Promise(function (resolve, reject) {
      x(url, 'title')(function (error, response) {
        if (error) reject(error)
        resolve(response)
      })
    })
  }
}

The webserver's access log now shows: xx.xx.xx.xx - - [07/Nov/2016:11:08:33 +0100] "GET / HTTP/1.1" 302 515 "-" "Foo/1.0"

kelvinu commented 7 years ago

I think it's possible to use request. Pass the user agent as a param and then pass the response object to x-ray to parse. However I couldn't seem to get pagination to work after that...

Have been experimenting with cheerio. But the selector syntax is not working ... At least I'm struggle with the syntax of selecting

UL LI DIV H5 A href On Mon, 7 Nov 2016 at 6:03 PM, Kees de Kooter notifications@github.com wrote:

I have tried to make things work with request-x-ray but I run into TypeError: x.driver is not a function

const x = require('x-ray') const requestXray = require('request-x-ray')

const options = { method: 'GET', headers: { 'User-Agent': 'Foo/1.0' } }

const driver = requestXray(options) x.driver(driver)

module.exports = {

getTitle: function (url) {

return new Promise(function (resolve, reject) {

  x(url, 'title')(function (error, response) {
    if (error) reject(error)
    resolve(response)
  })
})

} }

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lapwinglabs/x-ray/issues/168#issuecomment-258793783, or mute the thread https://github.com/notifications/unsubscribe-auth/ACEDTFBcuE_KmLH-L9nCwmlrXBJQlKmTks5q7vdVgaJpZM4IA_ya .

kdekooter commented 7 years ago

@kelvinu please take a look at my latest (edited) post.

kelvinu commented 7 years ago

Let me take a look thx

On Mon, 7 Nov 2016 at 8:35 PM, Kees de Kooter notifications@github.com wrote:

@kelvinu https://github.com/kelvinu please take a look at my latest (edited) post.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/lapwinglabs/x-ray/issues/168#issuecomment-258824706, or mute the thread https://github.com/notifications/unsubscribe-auth/ACEDTLC1JcSHfxIlEinxuOOaByl98NN0ks5q7xsPgaJpZM4IA_ya .

jacek213 commented 7 years ago

I need to set user-agent while using phantomjs driver, so there any way to achieve that?

ken0x0a commented 6 years ago
  1. copy driver function from x-ray-crawler/lib/http-driver in node_modules.
  2. change follows @ line 24
    agent
      .get(ctx.url)
      .set({
        ...ctx.headers,
        ...{ 'User-Agent': "anything" },
      })

    and set this as driver.

It works :)

jdalrymple commented 4 years ago
 .set({
        ...ctx.headers,
        ...{ 'User-Agent': "anything" },
      })

Does this still work?