Closed derekeder closed 10 years ago
@rcackerman what do you think? Thanks Derek!
That sounds good. I'll try it this week and see if the calendar site can handle it.
Thanks!
I've upped the requests_per_minute
to 180. It hasn't stalled out yet, so - fingers crossed - we're set there.
I don't think we need to cache, since we're only running this once a month. Is that off base, @derekeder?
@rackerman ackerman - it depends. how many records are you scraping that you've already seen from previous months?
On Wed, Sep 17, 2014 at 10:19 PM, Rebecca Ackerman <notifications@github.com
wrote:
I've upped the requests_per_minute to 180. It hasn't stalled out yet, so
- fingers crossed - we're set there.
I don't think we need to cache, since we're only running this once a month. Is that off base, @derekeder https://github.com/derekeder?
— Reply to this email directly or view it on GitHub https://github.com/rcackerman/parole-hearing-data/issues/17#issuecomment-55990498 .
Derek Eder @derekeder https://twitter.com/#!/derekeder derekeder.com derek.eder@gmail.com
We are indeed all set. I believe it finished in a few hours.
I'm still a bit confused about how we should do this going forward. @nikzei - is the following correct?
*
in the interview date column and the interview decision.So it might make sense to just scrape the last month of the hearings that have occurred and the last month of the scheduled hearings. Or do those scheduled hearings change? In that case we should re-scrape the 6 months every time.
@rcackerman first and last bullet are right. Columns get filled as the information comes in - they don't wait and fill the columns in at the end of the month; they post results within days of the decision. But for scraping, purposes it's the same, no?
I think that sounds like a good idea. My inclination though is to do full monthly scrapes for a few months in addition just to have something to test against. I'm happy to run these as well. What do you and @derekeder think - reasonable or unnecessary?
Don't know who any of you are. Third email I have received in this string. Please remove me from your conversation.
Sent from my iPhone
On Sep 20, 2014, at 12:23 PM, Nikki Zeichner notifications@github.com wrote:
@rcackerman https://github.com/rcackerman first and last bullet are right. Columns get filled as the information comes in - they don't wait and fill the columns in at the end of the month; they post results within days of the decision. But for scraping, purposes it's the same, no?
I think that sounds like a good idea. My inclination though is to do full monthly scrapes for a few months in addition just to have something to test against. I'm happy to run these as well. What do you and @derekeder https://github.com/derekeder think - reasonable or unnecessary?
— Reply to this email directly or view it on GitHub https://github.com/rcackerman/parole-hearing-data/issues/17#issuecomment-56272491 .
Sorry, @rackerman - you were added to this conversation by accident. You can unsubscribe from the conversation on Github, but unfortunately I do not believe we can take you off the thread ourselves, because of how Github works. The unsubscribe button is in the top right of the screen. It looks like this:
Ah sorry - my fault! meant to say @rcackerman earlier in the thread. Sorry @rackerman!
Ok, down to less than a half hour. Closing the issue!
nice!
On Wed, Nov 5, 2014 at 7:25 PM, Rebecca Ackerman notifications@github.com wrote:
Closed #17 https://github.com/rcackerman/parole-hearing-data/issues/17.
— Reply to this email directly or view it on GitHub https://github.com/rcackerman/parole-hearing-data/issues/17#event-189061532 .
Derek Eder @derekeder https://twitter.com/#!/derekeder derekeder.com derek.eder@gmail.com
Absolutely :)
On Thu, Nov 6, 2014 at 10:51 AM, Derek Eder notifications@github.com wrote:
nice!
On Wed, Nov 5, 2014 at 7:25 PM, Rebecca Ackerman notifications@github.com
wrote:
Closed #17 https://github.com/rcackerman/parole-hearing-data/issues/17.
— Reply to this email directly or view it on GitHub < https://github.com/rcackerman/parole-hearing-data/issues/17#event-189061532>
.
Derek Eder @derekeder https://twitter.com/#!/derekeder derekeder.com derek.eder@gmail.com
— Reply to this email directly or view it on GitHub https://github.com/rcackerman/parole-hearing-data/issues/17#issuecomment-61999520 .
I noticed that the scrapelib
requests_per_minute
is set to 60. I ran the scraper and it eventually finished, but only after many hours. Looking at output.csv, which has 52,740 rows, I'd guess it took at least:A few ideas to speed up the scraping:
requests_per_minute
to 120 or 180. This will speed it up 2x or 3x respectively if the scraped site can handle it.