matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.87k stars 349 forks source link

Pagination is broken #305

Closed norbert-gaulia closed 5 years ago

norbert-gaulia commented 6 years ago

When pagination is enabled it always loops through first/second paginating element. So from page 1 it goes to page 2, then on the page 2 it sees paginating link to page 1.
Then it goes to page 1, then loops again to page 2 etc. It should keep an array of visited pagination urls and do not try to request them again. I'm trying to make fix but i can't figure out what is where, can some one help?

norbert-gaulia commented 6 years ago

I guess pull request won't be accpted, i've got quick solution, who needs it replace var url = resolve($, false, paginate, filters) with

    var resolvedUrls = resolve($, false, [paginate], filters);
    var nextUrl = '';
    resolvedUrls.forEach(function (resolvedUrl) {
        var found = false;
        paginated.forEach(function (paginatedUrl) {
           if (resolvedUrl == paginatedUrl) {
               found = true;
            }
        });
        if (!found) {
            nextUrl = resolvedUrl;
        }
    });
    paginated.push(nextUrl);
    var url = nextUrl;
SergeyBoychuk commented 6 years ago

You need to find a page where the pagination doesn't change places, What i had to do was go on page like 5-10 and then scrape from there moving forward. Target the correct :nth-child and you should be fine :)