matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.88k stars 348 forks source link

empty URL in some instance composition on a collection killing rest of the capture #59

Open volumetric opened 9 years ago

volumetric commented 9 years ago

I am using the composition of instances to go to other pages and pick data from there for a collection. something like this.

var Xray = require('x-ray');
var x = Xray();

x('http://example.com', '.abc',[ {
  main: 'title',
  image: x('.pqr a@href', {
    key1: '.val1',
    key2: '.val2',
    key3: '.val3'
  }), // follow link to google images
}])(function(err, obj) {
  console.log(err, obj)
})

i am picking the URLs from a certain element with given selector .pqr a@href , but some of those URL values are empty and when the x() function is called with an empty URL it gives the error: [Error: is not a URL]

Because of this, i am not able to get the captured values for the rest of the urls for which .pqr a@href is not empty but a valid URL. i am not able to find a way to avoid the x() instance calls on empty URLs.

One possible solution could be, if the call on x() with an empty URL should just quietly die, instead of throwing an error, which tips off the rest of the instance calls.

Would highly appreciate if someone can help me in this. Thanks

matthewmueller commented 9 years ago

Ahh shoot, yah. We should probably not throw and skip in that case. I'd really appreciate a PR for this fix if you have a moment.

I'm currently thinking of ways to make the API a little more robust to handle cases like this, and intermediate parsing. Right now it's sort of all or nothing.

volumetric commented 9 years ago

Hi @matthewmueller Can you tell me where i should be looking, Thanks.

matthewmueller commented 9 years ago

@volumetric here's where the error is:

https://github.com/lapwinglabs/x-ray/blob/master/index.js#L95