matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.87k stars 349 forks source link

Multiple Scoping with one selector #108

Closed alejodiazg closed 5 years ago

alejodiazg commented 8 years ago

I was following this example (scoping selectors)

var Xray = require('x-ray');
var x = Xray();

x('http://mat.io', {
  title: 'title',
  items: x('.item', [{
    title: '.item-content h2',
    description: '.item-content section'
   }])
 })(function(err, obj) {
 /*
   {
     title: 'mat.io',
    items: [
      {
        title: 'The 100 Best Children\'s Books of All Time',
        description: 'Relive your childhood with TIME\'s list...'
      }
    ]
   }
*/
})

and I thought it meant inside mat.io find all .item and add an object with the defined structure to an array in items, but it seems it only gets the first .item. How do I make something similar but with each .item that is in the page that im scraping?

im using this selector (is complicated the page is really messy) "table[width=748]:last-of-type tr:not(:first-of-type)" and I have multiple tr that are after the first one, but with that I only get one (the second one cause my selector makes it ignore the first)

my code

xray("http://www.pearsondental.com/catalog/product.asp?catid=4598&subcatid=12853&majcatid=602&dpt=0&pre_cat_id=&mart=&cat_link=", 
    'body',
    [{
        manufacter: 'b font[color=990000]',
        products : xray('table[width=748]:last-of-type tr:not(:first-of-type)' , [{
            name: 'td:first-of-type font b',
            price: 'td:nth-of-type(3) font strong',
            mfg_part: 'td.link2:first-of-type font:last-of-type'
        }])
    }]) (function(err , obj){
        //console.log(err);
        console.log(util.inspect(obj , {showHidden: false, depth: null}));
        //products++;
        if(obj == null){
            console.log(err);
            return;
        }
    })

the console.log output

[ { manufacter: '(Coltene)',
    products:
     [ { name: 'Perm Reline Introductory Pkg. Regular Pink',
         price: '$62.50',
         mfg_part: 'Mfg. Part #: 00335' } ] } ]
gillescastel commented 8 years ago

:+1: Or the example is misleading, items, or this is a bug.

nullzion commented 8 years ago

Same problem here, array of objects returns single item event though there are multiple items available.

damonmcminn commented 8 years ago

I encounter the same issue. If this is a bug, I'm happy to fix it.

Kikobeats commented 8 years ago

@damonmcminn the point is, using a DOM selector, like jQuery, what is the expected output?

alejodiazg commented 8 years ago

@Kikobeats I believe the expected result is an array of all the elements that match the query selector, thats what the example tells us to expect.

Unfortunately I had to stop using x-ray for the scraping that i was doing, but i'll be glad to give more feedback, maybe share the whole code (I think I still have it) if its needed.

damonmcminn commented 8 years ago

Investigating deeper, I believe I misunderstood the docs and, for me at least, it's a user error. The below works as expected:

var Xray = require('x-ray');
var x = new Xray();

var hn = 'https://news.ycombinator.com';

var a = x(hn, x('.itemlist .athing', [{link: '.title span a@href'}]));
var b = x(hn, '.itemlist .athing', [{link: '.title span a@href'}]);
var multiple = x(hn, { a, b });

multiple((err, data) => console.log(err || data));

output

{ a: 
   [ { link: 'https://news.ycombinator.com/from?site=cloud.google.com' },
     { link: 'https://news.ycombinator.com/from?site=samaltman.com' },
     { link: 'https://news.ycombinator.com/from?site=windows.com' } ... ],
  b: 
   [ { link: 'https://news.ycombinator.com/from?site=cloud.google.com' },
     { link: 'https://news.ycombinator.com/from?site=samaltman.com' },
     { link: 'https://news.ycombinator.com/from?site=windows.com' } ... ] }