matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.88k stars 350 forks source link

crawl to nested url feature is not functioning #121

Closed madibalive closed 8 years ago

madibalive commented 8 years ago

the crawl to another page feature it not working ,as it does not return the image part

var Xray = require('x-ray');
var x = Xray();

x('http://google.com', {
   main: 'title',
   image: x('#gbar a@href', 'title'), // follow link to google images
})(function(err, obj) {
/*
   {
     main: 'Google',
     image: 'Google Images'
    }
*/
})
0xgeert commented 8 years ago

Untested but you need to specify you need multiple results. So:

        var Xray = require('x-ray');
        var x = Xray();

        x('http://google.com', {
           main: 'title',
           image: x('#gbar a@href', ['title']), // follow link to google images
        })(function(err, obj) {
        /*
           {
             main: 'Google',
             image: 'Google Images'
            }
        */
        })
madibalive commented 8 years ago

still not returning , image using this logic which definently returns multiple item

var html = 'http://www.myjoyonline.com/news.php';

x(html, 'ul.opinion-listings li', [{
    title: '.head .title a',
    desc: '.info',
    date: '.head',
    img: '.image-inner a img@src',
    url: '.head .title a@href',
    fullStory: x('.head .title a@href', ['div.side1 div.storypane p']),
}])(function(err, obj) {
    console.log(obj);
});

still doesnt return

madibalive commented 8 years ago

i tried using a nested callback , but returns only null values

var Xray = require('..');
var x = Xray();

var html = 'http://www.myjoyonline.com/news.php';

function getter(url ,callback) {
    setTimeout(function() {
        x(url, ['div.side1 div.storypane p'])(function(err, obj) {
            callback(obj);
        });
    }, 500);
}

x(html, 'ul.opinion-listings li', [{
    title: '.head .title a',
    desc: '.info',
    date: '.head',
    img: '.image-inner a img@src',
    url: '.head .title a@href'
        // fullStory: x('.head .title a@href', ['div.side1 div.storypane p']),
}])(function(err, obj) {
    for (var i = 0; i < obj.length; i++) {
        console.log(obj[i]);
        getter(obj[i].url,function(results) {
            var fullStory= results;
            obj[i].moreDetails = fullStory;
            console.log(obj[i]);
        });
    }

});
heady commented 8 years ago

@madibalive

In your example, the 'img' selector is broken. This is preventing the second xray from running. Perhaps the anchor needs a direct selector, but it isn't specific enough to pass.
Fix:

var html = 'http://www.myjoyonline.com/news.php';
x(html, 'ul.opinion-listings li', [{
    title: '.head .title a',
    desc: '.info',
    date: '.head',
    img: 'div.image-inner > a img@src',
    url: '.head .title a@href',
    fullStory: x('.head .title a@href', ['div.side1 div.storypane p'])
}])
(function(err, obj) {
    console.log(obj);
});

Outputs the array of..

{ title: 'Former GMA president has died',
    desc: 'Immediate Past President of the Ghana Medical Association, Dr. Adom Winful has passed away Thursday afternoon at the Tetteh Quarshie....',
    date: '\n31st December, 2015\nFormer GMA president has died\n',
    img: 'http://photos.myjoyonline.com/photos/news/201110/737707716_922165.jpg',
    url: 'http://www.myjoyonline.com/news/2015/December-31st/former-gma-president-has-died.php',
    fullStory:
     [ 'Immediate Past President of the Ghana Medical Association, Dr. Adom Winful has passed away Thursday afternoon at the Tetteh Quarshie hospital after a short illness.',
       'Greater Accra President of the GMA Dr. York confirmed the tragic news to Joy FM.',
       'Dr. Emmanuel Winful was president of the GMA from 2007 to 2011. During his second term, the GMA locked horns with government over the implementation of the Single Spine Salary Structure in 2011.',
       'Dr. Winful led a crippling strike by doctors in the public service insisting that;',
       '“The single spine salary seems to have brought every body up whilst the doctor remains static relative to each other. There have been serious distortions in the relativities of remunerations and that is our bone of contention".',
       'More soon...'
     ]
 }
madibalive commented 8 years ago

thanks for the reply , went with another library , can i ask this x-ray uses ( x-ray fetch)(annoymous function here )
is the underlaying design using a self executing function ?? thanks again

heady commented 8 years ago

@madibalive The creator of x-ray also made cheerio which it relies on for selectors.
I'm not in a position to address the underlying design but it's self contained on git.

madibalive commented 8 years ago

hi , i closed it without testing it , it still doesnt return the fullstory , as it does from yours , how do i debug it , since no error get return so i can try fixing it . I clone a new one , but i get error cant find module x-ray-crawler when i run node test.js (no file modified,just running the test.js provider in the clone )

heady commented 8 years ago

There is no x-ray-crawler, just x-ray. Full example:

var Xray = require('x-ray');
var x = Xray();

var html = 'http://www.myjoyonline.com/news.php';

x(html, 'ul.opinion-listings li', [{
    title: '.head .title a',
    desc: '.info',
    date: '.head',
    img: 'div.image-inner > a img@src',
    url: '.head .title a@href',
    fullStory: x('.head .title a@href', ['div.side1 div.storypane p'])
}])
(function(err, obj) {
    console.log(obj);
});
madibalive commented 8 years ago

:+1: was using wrong folder , works perfect now ,but i went with osmosis , but using this for my next project , thanks :100:

umpirsky commented 8 years ago

@heady Your example from https://github.com/lapwinglabs/x-ray/issues/121#issuecomment-168470415 returns no fullStory for me with x-ray 2.0.3.

What version do you use?

alfonsodg commented 8 years ago

Hi, i am trying the example from heady and nothing, full history is not scraped, apparently is a version issue, i am using 2.03