propublica / upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
MIT License
1.62k stars 113 forks source link

Recursive function causing a stack overflow #23

Closed esagara closed 10 years ago

esagara commented 10 years ago

https://github.com/propublica/upton/blob/master/lib/upton.rb#L314-L326

Will cause a stack overflow with large paginations >2300 or so. Possible solution:

def get_instance(url, pagination_index=0, options={})
  resp = self.get_page(url, @debug, options)
  i = pagination_index.to_i
  while !resp.empty?
    next_url = self.next_instance_page_url(url, i += 1)
    next_resp = self.get_page(next_url, @debug, options)
    break if next_url == url
    resp += next_resp
  end
  resp
end
jeremybmerrill commented 10 years ago

Makes total sense and at first glance I think your solution will work. Can you send me a pull request?

On Tue, Nov 5, 2013 at 6:34 PM, Eric Sagara notifications@github.comwrote:

https://github.com/propublica/upton/blob/master/lib/upton.rb#L314-L326

Will cause a stack overflow with large paginations >2300 or so. Possible solution:

def get_instance(url, pagination_index=0, options={}) resp = self.get_page(url, @debug, options) i = pagination_index.to_i while !resp.empty? next_url = self.next_instance_page_url(url, i += 1) next_resp = self.get_page(next_url, @debug, options) break if next_url == url resp += next_resp end resp end

— Reply to this email directly or view it on GitHubhttps://github.com/propublica/upton/issues/23 .

jeremybmerrill commented 10 years ago

Yo Eric, did your PR close this issue?

esagara commented 10 years ago

I think so, there was an issue somewhere in that where it would get caught in a loop. I am having another problem though. The sleep_time_between_requests does not seem to be working. Have you played around with it at all? Perhaps I am missing something in the syntax.

Eric

On Wed, Dec 18, 2013 at 3:54 PM, Jeremy B. Merrill <notifications@github.com

wrote:

Yo Eric, did your PR https://github.com/propublica/upton/pull/24 close this issue?

— Reply to this email directly or view it on GitHubhttps://github.com/propublica/upton/issues/23#issuecomment-30879699 .

esagara commented 10 years ago

Is it possible that the below line is not evaluating to true? I can see that both @verbose and @sleep_time_between_requests are being passed to the scraper, but the sleep time is not being implemented from what I can tell.

https://github.com/propublica/upton/blob/master/lib/upton.rb#L223

Eric

On Wed, Dec 18, 2013 at 9:49 PM, Eric Sagara esagara@gmail.com wrote:

I think so, there was an issue somewhere in that where it would get caught in a loop. I am having another problem though. The sleep_time_between_requests does not seem to be working. Have you played around with it at all? Perhaps I am missing something in the syntax.

Eric

On Wed, Dec 18, 2013 at 3:54 PM, Jeremy B. Merrill < notifications@github.com> wrote:

Yo Eric, did your PR https://github.com/propublica/upton/pull/24 close this issue?

— Reply to this email directly or view it on GitHubhttps://github.com/propublica/upton/issues/23#issuecomment-30879699 .

jeremybmerrill commented 10 years ago

I'm not sure exactly what's happening, but I noted it in #28.

Will look into it greater depth shortly. I'll write a test too :)