propublica / upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
MIT License
1.62k stars 113 forks source link

Downloading and Caching part #10

Closed kgrz closed 11 years ago

kgrz commented 11 years ago

I was trying to separate the downloading and caching code from the main upton.rb file. The separation would lead to easier testing and further expansion. Can we have that code extracted as a separate gem ( I was surprised to see that a gem with this functionality existed out there. Or may be it did and I didn't know). That would mean that the Upton code would depend on this external gem.

Would love to hear views for and against this.

jeremybmerrill commented 11 years ago

So are you thinking a gem that sits between the client and the web, caching all requests transparently?

kgrz commented 11 years ago

Yes.

Kashyap KMBC

jeremybmerrill commented 11 years ago

Seems like a good idea.

The only reason I can think of NOT to do that is that there may need to be a tight coupling of some higher-level Upton stuff (e.g. varying how a page is fetched based on whether it's an instance or an index, or by the index of the instance page being fetched). This can probably be accomodated with just an options hash passed to the fetch method of the new gem, so it's just something to keep in mind.

I'm not sure this would be my first priority, but I think it's a good idea. If you want to do it and submit a pull request, I'm happy to help, otherwise I'll get to it eventually.

I think separating out the get_page and fetch_page methods is all that would need to be done, along with including some of the instance-wide variables like @debug and @verbose.

kgrz commented 11 years ago

The only reason I can think of NOT to do that is that there may need to be a tight coupling of some higher-level Upton stuff (e.g. varying how a page is fetched based on whether it's an instance or an index, or by the index of the instance page being fetched). This can probably be accommodated with just an options hash passed to the fetch method of the new gem, so it's just something to keep in mind.

Yeah, I had that in mind. But I assure you this is different. Take a look at this for an idea: https://github.com/kgrz/cachify

The library does only two things:

  1. Makes the HTTP call to the website we specify
  2. Caches it for future usage.

Actually, the only reason I have to extract this functionality as a gem was just that there were no existing libraries for a fairly regularly sought-after functionality. :) Initially, I had all that logic inside Upton itself.

I think separating out the get_page and fetch_page methods is all that would need to be done, along with including some of the instance-wide variables like @debug and @verbose

Yes, that is what exactly I did. The @debug option (which is basically an alias for caching), is included in the library. I will add the verbose and logging parts to it next. The get_page is equivalent to Cachify::Download.(..opts..).get and it internally manages how and where the data is fetched.

And yes, I will work on this and won't waste your time :D

jeremybmerrill commented 11 years ago

Awesome, awesome! Super cool. You're totally right that it makes sense to pull this stuff out of Upton. Looking forward to seeing what comes of this.

Some of the settings in get_page around accept headers and encodings might come in handy to solve various issues...

kgrz commented 11 years ago

Okay, I will keep an eye on those :)

jeremybmerrill commented 11 years ago

Closing this issue, @kgrz, since this was solved, as I understand it, in 2764452156343610897398e24f676587587c17d4