speed up time to scrape?

derekeder commented 10 years ago

I noticed that the scrapelib requests_per_minute is set to 60. I ran the scraper and it eventually finished, but only after many hours. Looking at output.csv, which has 52,740 rows, I'd guess it took at least:

52740 pages / 60 seconds / 60 minutes = 14.65 hours

A few ideas to speed up the scraping:

increase the requests_per_minute to 120 or 180. This will speed it up 2x or 3x respectively if the scraped site can handle it.
enable scrapelib caching. this will only help when running it multiple times (which it sounds like you are doing), but caching the pages you've already seen means you don't have to do another request for it. take a look at how @evz set caching up for IL political candidates.

nikzei commented 10 years ago

@rcackerman what do you think? Thanks Derek!

rcackerman commented 10 years ago

That sounds good. I'll try it this week and see if the calendar site can handle it.

Thanks!

rcackerman commented 10 years ago

I've upped the requests_per_minute to 180. It hasn't stalled out yet, so - fingers crossed - we're set there.

I don't think we need to cache, since we're only running this once a month. Is that off base, @derekeder?

derekeder commented 10 years ago

@rackerman ackerman - it depends. how many records are you scraping that you've already seen from previous months?

On Wed, Sep 17, 2014 at 10:19 PM, Rebecca Ackerman <notifications@github.com

wrote:

I've upped the requests_per_minute to 180. It hasn't stalled out yet, so

fingers crossed - we're set there.

I don't think we need to cache, since we're only running this once a month. Is that off base, @derekeder https://github.com/derekeder?

— Reply to this email directly or view it on GitHub https://github.com/rcackerman/parole-hearing-data/issues/17#issuecomment-55990498 .

Derek Eder @derekeder https://twitter.com/#!/derekeder derekeder.com derek.eder@gmail.com

rcackerman commented 10 years ago

We are indeed all set. I believe it finished in a few hours.

I'm still a bit confused about how we should do this going forward. @nikzei - is the following correct?

At the beginning of the month, there are parole hearings scheduled up to 6 months out. Scheduled meetings show up with a * in the interview date column and the interview decision.
At the end of the month, those columns get filled in.
Information from previous months is never changed after the fact; a parolee might go before the board again, but that is a new interview, not a change to the original entry.

So it might make sense to just scrape the last month of the hearings that have occurred and the last month of the scheduled hearings. Or do those scheduled hearings change? In that case we should re-scrape the 6 months every time.

nikzei commented 10 years ago

@rcackerman first and last bullet are right. Columns get filled as the information comes in - they don't wait and fill the columns in at the end of the month; they post results within days of the decision. But for scraping, purposes it's the same, no?

I think that sounds like a good idea. My inclination though is to do full monthly scrapes for a few months in addition just to have something to test against. I'm happy to run these as well. What do you and @derekeder think - reasonable or unnecessary?

rackerman commented 10 years ago

Don't know who any of you are. Third email I have received in this string. Please remove me from your conversation.

Sent from my iPhone

On Sep 20, 2014, at 12:23 PM, Nikki Zeichner notifications@github.com wrote:

@rcackerman https://github.com/rcackerman first and last bullet are right. Columns get filled as the information comes in - they don't wait and fill the columns in at the end of the month; they post results within days of the decision. But for scraping, purposes it's the same, no?

I think that sounds like a good idea. My inclination though is to do full monthly scrapes for a few months in addition just to have something to test against. I'm happy to run these as well. What do you and @derekeder https://github.com/derekeder think - reasonable or unnecessary?

— Reply to this email directly or view it on GitHub https://github.com/rcackerman/parole-hearing-data/issues/17#issuecomment-56272491 .

rcackerman commented 10 years ago

Sorry, @rackerman - you were added to this conversation by accident. You can unsubscribe from the conversation on Github, but unfortunately I do not believe we can take you off the thread ourselves, because of how Github works. The unsubscribe button is in the top right of the screen. It looks like this:

screenshot 2014-09-22 12 25 24

derekeder commented 10 years ago

Ah sorry - my fault! meant to say @rcackerman earlier in the thread. Sorry @rackerman!

rcackerman commented 10 years ago

Ok, down to less than a half hour. Closing the issue!

derekeder commented 10 years ago

nice!

On Wed, Nov 5, 2014 at 7:25 PM, Rebecca Ackerman notifications@github.com wrote:

Closed #17 https://github.com/rcackerman/parole-hearing-data/issues/17.

— Reply to this email directly or view it on GitHub https://github.com/rcackerman/parole-hearing-data/issues/17#event-189061532 .

Derek Eder @derekeder https://twitter.com/#!/derekeder derekeder.com derek.eder@gmail.com

nikzei commented 10 years ago

Absolutely :)

On Thu, Nov 6, 2014 at 10:51 AM, Derek Eder notifications@github.com wrote:

nice!

On Wed, Nov 5, 2014 at 7:25 PM, Rebecca Ackerman notifications@github.com

wrote:

Closed #17 https://github.com/rcackerman/parole-hearing-data/issues/17.

— Reply to this email directly or view it on GitHub < https://github.com/rcackerman/parole-hearing-data/issues/17#event-189061532>

.

Derek Eder @derekeder https://twitter.com/#!/derekeder derekeder.com derek.eder@gmail.com

— Reply to this email directly or view it on GitHub https://github.com/rcackerman/parole-hearing-data/issues/17#issuecomment-61999520 .

rcackerman / parole-hearing-data

speed up time to scrape? #17