matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.88k stars 350 forks source link

.delay / .timeout is undefined? #120

Closed vodp closed 5 years ago

vodp commented 8 years ago

I could not set delay() and timeout() with x-ray ? In the below is my code

var xray = require('x-ray');
var x = xray();

x('http://www.robotshop.com/en/robots-for-the-house.html', '.wrap-thumbnailCatTop', [{
  image: 'a img@src',
  product: '.product-name a@title',
  details: x('.product-name a@href', {
    description: '.aboutContent.clearfix .std',
    reviews: ['h3.review-title'],
    users: ['div.aboutHeaders > h3 + div'],
    comments: ['#product-reviews-list > dd > table + p'],
    date: ['#product-reviews-list > dd > p + p']
  }),
  code: '.product-code',
  ratings: '.ratings .amount a',
  price: '.price-box .regular-price .price'
}])
  .paginate('.pages > ol > li.current + li > a@href')
  .delay(5000)
  .limit(100)
  .write('results_robot4house.json')

Executing node testxray.js gives me

/mambo/scraping/testxray.js:19
  .delay(5000)
   ^
TypeError: undefined is not a function
    at Object.<anonymous> (/mambo/scraping/testxray.js:19:4)
    at Module._compile (module.js:460:26)
    at Object.Module._extensions..js (module.js:478:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Function.Module.runMain (module.js:501:10)
    at startup (node.js:129:16)
    at node.js:814:3
anothergituser commented 8 years ago

Yeah, same here. It's not defined yet

andrewtennison commented 8 years ago

+1

Kikobeats commented 8 years ago

This methods are inherited from crawler. Check this lines of code:

https://github.com/lapwinglabs/x-ray/blob/master/index.js#L39 https://github.com/lapwinglabs/x-ray/blob/master/index.js#L255

If doesn't working something is happening. We can verify it quickly with a unit test.

jasonk commented 8 years ago

From reading the code it looks to me like the documentation is at best unclear.

Take a look at this sample code from the documentation:

var Xray = require('x-ray');
var x = Xray();

x('https://dribbble.com', 'li.group', [{
  title: '.dribbble-img strong',
  image: '.dribbble-img [data-src]@data-src',
}])
  .paginate('.next_page@href')
  .limit(3)
  .write('results.json')

And then the API section of the docs refers to these methods:

xray.driver
xray.stream
xray.write
xray.paginate
xray.limit
xray.delay
xray.concurrency
xray.throttle
xray.timeout

This leads you to believe that you can do this:

x('https://dribbble.com', 'li.group', [{
  title: '.dribbble-img strong',
  image: '.dribbble-img [data-src]@data-src',
}])
  .paginate('.next_page@href')
  .limit(3)
  .timeout(30)
  .driver('phantomjs')
  .delay(100)
  .write('results.json')

However, you can't actually do that! In this example code there are actually two different objects in use. One is the object returned by calling Xray() and the other is the object returned by calling x() (which the code refers to as 'node').

You would actually have to write it like this:

x.timeout(30).driver('phantomjs').delay(100)('https://dribbble.com', 'li.group', [{
  title: '.dribbble-img strong',
  image: '.dribbble-img [data-src]@data-src',
}])
  .paginate('.next_page@href')
  .limit(3)
  .write('results.json')

That makes it look a bit confusing, but it makes more sense if you separate out the two objects like this:

var Xray = require('x-ray');
var x = Xray().timeout(30).driver('phantomjs').delay(100);

x('https://dribbble.com', 'li.group', [{
  title: '.dribbble-img strong',
  image: '.dribbble-img [data-src]@data-src',
}])
  .paginate('.next_page@href')
  .limit(3)
  .write('results.json')
tzehsiang commented 8 years ago

I just ran into the same issue and thanks for the explanation. It would be good if the document can be fixed soon.

kdoggthebus commented 7 years ago

+1 to @tzehsiang

Am loving this library, documentation could use more detail and examples. I see there's been an open pull request for a while now though. Would be nice to see that merged!

Thanks for the clarification @jasonk