matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.88k stars 350 forks source link

Unexpected elements break crawl #126

Closed deathg0d closed 8 years ago

deathg0d commented 8 years ago

The page I want to crawl looks something like this:

...
<div class="content">
      <div class="item">
          <a href="/url1"></a>
     </div>
      <div class="item">
          <a href="/url2"></a>
     </div>
      <div class="item">
          <a href="/url3"></a>
     </div>
</div>
...

So I restrict x-ray to this snippet and make it follow the urls

x('/someurl', 'div.content div.item', [{
    description: x('a@href', 'div.article')
}])(function(err, obj) {
    // working fine 
});

But now let's say the snippet is not as we had expected. There was a weird item div in the middle.

...
<div class="content">
      <div class="item">
          <a href="/url1"></a>
     </div>
      <div class="item">
          <script>//this weird guy here</script>
     </div>
      <div class="item">
          <a href="/url3"></a>
     </div>
</div>
...

Instead of ignoring the weird guy and crawling through other items, the whole process stops with a [Error: undefined is not a URL] error.

0xgeert commented 8 years ago

See PR in #112

matthewmueller commented 8 years ago

closed via #112