matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.88k stars 350 forks source link

// Nested crawling broken on 'master'. #181

Closed Muscot closed 8 years ago

Muscot commented 8 years ago

Description

I think I fixed this issue, I also added a follow.js example to test. Nested crawling broken on 'master'. When to merge 'bugfix/nested-crawling' #111

      // Needed to exit this without calling next, the problem was that it returned to 
      // the "finished" callback before it had retrieved all pending request. it should
      // wait for "return next(null, compact(out))"

x('http://www.imdb.com/', { title: ['title'], links: x('.rhs-body .rhs-row', [{ text: 'a', href: 'a@href', next_page: x('a@href', { title: 'title', heading: 'h1' }) }]) })(function(err, obj) { console.log(obj); });

Checklist

coveralls commented 8 years ago

Coverage Status

Coverage decreased (-0.3%) to 95.846% when pulling f1cb2e31125409d46750eb49919871401803fd4c on Muscot:master into 8eb8f11067542bd9673894560a32f270c7183f61 on lapwinglabs:master.

mcornella commented 8 years ago

Adding the return statement works for me. For some reason the coverage is decreased on this PR but all tests pass correctly.

Also, perhaps we could run a static server for tests to test this exact thing as well as these other tests.

GeoffreyEmery commented 8 years ago

can we get this checked in?

tribet84 commented 8 years ago

I tried your solution and still doesn't work for this example here: It works if you comment out the uprice. If you try to crawl then you get a nice "undefined is not a URL"

    x('https://www.simply.es/compra-online/especiales-2.html', {
      title: ['title'],
      items: x('.listing ul', [{
        main: '.descripcionProducto',
        link: '.descripcionProducto@href',
        uprice: x('.descripcionProducto a@href', '.precioKilo')
      }])
    })(function (err, obj) {
      if (err) { console.log(err); }
      console.log(obj);
    });

UPDATE: if you remove the square brackets also works.

Muscot commented 8 years ago

Hi,

On the "link" it seems like descripcionProducto is a anchor and on the uprice you search for a anchor in ".descripcionProducto"?

Shouldn't it be something like this? couldn't access the www.simply.es url.

x('https://www.simply.es/compra-online/especiales-2.html', { title: ['title'], items: x('.listing ul', [{ main: '.descripcionProducto', link: '.descripcionProducto@href', uprice: x('.descripcionProducto@href', '.precioKilo') }]) })(function (err, obj) { if (err) { console.log(err); } console.log(obj); });

tribet84 commented 8 years ago

Hi @Muscot , Excellent that works perfect! you made my day

Cheers, r

Muscot commented 8 years ago

I'm glad I could help! Cheers!

tribet84 commented 8 years ago

so you think guys we can have this merged and published?

GeoffreyEmery commented 8 years ago

just used this on another project and it is still broken.. Do we just need to up this test coverage in order for this to get checked in?

bisubus commented 8 years ago

@matthewmueller Friendly ping. Would you consider merging this PR? It appears that the package is terribly broken for advanced usage since 2.0.3, see #189. I guess that fellow devs have no choice but to stick to forks or roll back to 2.0.2.

darcyclarke commented 8 years ago

Just going to also throw in a friendly reminder here. This is the only real game changer for me in terms of functionality compared to rolling my own implementation with request + cheerio. Having the ability to follow nested links is key for simple pagination/scraping implementations.

Can we push this through?

matthewmueller commented 8 years ago

sorry for the delay, thanks for your help @Muscot !

matthewmueller commented 8 years ago

fixed in 2.3.1

if anyone wants to help maintain this library, so we can push these fixes through faster, let me know :-)