matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.87k stars 349 forks source link

Crawling SyntaxError: Unmatched selector: @href #257

Closed Janghou closed 5 years ago

Janghou commented 7 years ago

Subject of the issue

Error while scraping crawled links, doesn't make collection.

Your environment

Steps to reproduce

var Xray = require('x-ray');
 var x = Xray();
html = '<div> <a href="http://www.google.com">google</a>  <a href="http://www.bing.com"> bing </a> </div>';
x(html,'a', [{
 engine: x('a'),  
 links: x('a@href', [{href:'a@href',text:'a'}])
}]).write('result.json');

Expected behaviour / result

 [ { engine: 'google', links: 
 [{ href: 'http://www.google.com/imghp?tab=wi', text: 'Images' },
 ...
 ]},
 { engine: 'bing', links: 
  [{ href: 'javascript:void(0)' }.
    ...
  ]} ]

Actual behaviour

SyntaxError: Unmatched selector: @href

While without the brackets (array/collection) it does give the first link as result, like above:

     links: x('a@href', {href:'a@href',text:'a'})

Also this works:

   x(html,'a', [{
    engine: x('a'),  
    href: x('a@href', ['a@href']),
    text: x('a@href', ['a']),
   }])

This gives two arrays (href, and text) as result, so you expect that surrounding the original selector with brackets it should return a collection of links.

AFAICS, it sets the wrongscope on line 218 in index.js. Probably it can't parse the sub-scope at the moment, but it would be nice if it can.

Any ideas?

FezVrasta commented 6 years ago

Had you luck with this?

kfkhalili commented 6 years ago

Having this same problem now.

dustinschaerer commented 6 years ago

I'm also having the same problem.

lathropd commented 5 years ago

Wouldn’t using just ‘a’ as a selector work? a@href isn’t a valid Cheerio selector.

ghost commented 5 years ago

Recently stumbled upon same problem, any potential fixes?

lathropd commented 5 years ago

See above: 'a@href' isn't a selector in that use case. Just use 'a'.