propublica / upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
MIT License
1.62k stars 113 forks source link

Pagination always double-downloads first page #37

Closed jaypinho closed 10 years ago

jaypinho commented 10 years ago

Hi there,

First off: this is a very cool tool. Thanks so much for putting this together.

I'm a bit of a coding/scraping n00b, so forgive me if I'm missing something obvious here. But I've now tested multiple times using pagination for the index, and I believe there's a minor bug.

I'm using both "pagination_start_index" and "pagination_max_pages" (which, as a side note, doesn't actually designate how MANY pages to paginate but simply which page is the highest one it will go to -- it may be better to call this "pagination_end_index" or something similar).

No matter what I choose, the paginator will eventually download the first page twice. So if I set pagination_start_index to 15 and pagination_max_pages to 18, it will download 15, 16, 17, 18, and then 15 again.

Thank you!

jeremybmerrill commented 10 years ago

Hi Jay,

That definitely sounds plausible.

Can you send me the code snippet so I can verify? (if it's sensitive and it'd make you more comfortable sharing, feel free to change the URL or send it to me privately at this username at gmail dot com -- I promise not to post it publicly, etc.)

And how do you know that the first page gets downloaded twice? Are its results just duplicated in the output? e.g. [["page", "15's", "output"], ["page", "16's", "output"],["page", "17's", "output"], ["page", "18's", "output"],["page", "15's", "output"]] Or are you checking the logs on the server or watching your network traffic or something?

jaypinho commented 10 years ago

Thank you! I've just sent you an email.

jeremybmerrill commented 10 years ago

Fixed in https://github.com/propublica/upton/commit/2d8288b1253c607837d03c51dcfd53c6f08d2820; thanks for the report!