matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.88k stars 348 forks source link

Support pagination based on function #54

Open davis opened 9 years ago

davis commented 9 years ago

what if there is no link (therefore no css selector) to the next page, but i know how to get to it?

e.g. ?page=1, ?page=2

could i use something like .paginate(function(num) { return '?page=' + num; }))

matthewmueller commented 9 years ago

Yep that's a good idea, I'll probably get around to it when I have some time, but definitely accepting PRs in the meantime

anasqadrei commented 9 years ago

+1

monolithed commented 9 years ago

+1

monolithed commented 9 years ago

For example, I've the following case

<div class="foo" data-next="/page=2"> </div>

And I could not parse the next page :(

.paginate('.foo@data-next')

Does not work for me

ikeorizu commented 7 years ago

Hi guys, to follow up. This is what I have <a id="ctl00_CP_DataPagerDisplay1_next" class="margin-left-one" href="javascript:__doPostBack('ctl00$CP$DataPagerDisplay1$next','')">Next <i class="icon-bby-triangle-right"></i> </a>

How do I paginate this? Thanks

Aathi commented 7 years ago

Could anyone help me with this type of pagination?

beer_-_beers__wines___spirits_-_bestway_wholesale

I have tried this .paginate('p.listingsnav @href')but doesn't work. any suggestions? base url is https://www.bestwaywholesale.co.uk/index.php?s=0&sort=PO&cats_l2=81-791&np=4

dschreij commented 6 years ago

Does anyone know the status of this one? I'm bumping into the same problem at the moment. Maybe the paginate function could have a similar signature as the abort() function. So it accepts a callback function that is passed the scrape results of the current page. Then people can engineer the url selector (or URL) of the next page in the CB and return this (instead of true/false as abort() does)? I can give this a go if this approach seems good.

jpzbkk commented 1 year ago

I know this is late, but I figured it out, just use a filter! I was banging my head around this, and I saw another package, x-ray-scraper which also didn't work. The filter worked perfectly. Enjoy!


var filters = {
  trim: function (value) {
    return typeof value === 'string' ? value.trim() : value;
  },
  paginateFx: function (value) {
    //https://testUrl.com/?no-cache=Covpq01DU3r5jMRVgOSa&page=2
    // extract the page number from the url
    return typeof value === 'string' ? pageLessUrl + value.split('&page=')[1] : false;
  },
};

const x = XRay({ filters }).concurrency(10); // 10 concurrent requests

    x(url + 1, '.product', [
      {
        description: '.product-name',
        link: '.product-name a@href',
        img: '.img-responsive @src',
      },
    ])
      .paginate('.pagination .next a@href | paginateFx')
      .limit(5)