propublica / upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
MIT License
1.61k stars 112 forks source link

Refactor API #5

Open adelevie opened 11 years ago

adelevie commented 11 years ago

Code:

require "bundler/setup"
require "upton"

url = "http://www.fcc.gov/encyclopedia/fcc-and-courts"
xpath = "//*[@id='node-18004']/div[2]/ul/li[1]/a"
Upton::Scraper.new(url)
  .scrape_to_csv("output.csv", &Upton::Utils.list(xpath, :xpath))

Error:

$ rescue scrape.rb 

Frame number: 4/11
Frame type: top

From: /Users/adelevie/programming/fcc-ogc-scraper/scrape.rb @ line 8 :

    3: require "upton"
    4: 
    5: url = "http://www.fcc.gov/encyclopedia/fcc-and-courts"
    6: xpath = "//*[@id='node-18004']/div[2]/ul/li[1]/a"
    7: 
 => 8: Upton::Scraper.new(url)
    9:   .scrape_to_csv("output.csv", &Upton::Utils.list(xpath, :xpath))

NoMethodError: undefined method `each_with_index' for "http://www.fcc.gov/encyclopedia/fcc-and-courts":String
from /Users/adelevie/.rvm/gems/ruby-1.9.3-p194/gems/upton-0.2.6/lib/upton.rb:256:in `scrape_from_list'

The readme example uses a String as the argument for Upton::Scraper::new, but for some reason something that responds to #each_with_index is expected here. It could be that the xpath isn't good, but if so, the error shouldn't be pointing to the String provided on line 8.

jeremybmerrill commented 11 years ago

@adelevie, the immediate fix is to put url in a single-element array, as

Upton::Scraper.new([url])

In Scraper.new(), Upton expects a list of pages to apply xpath OR a single page URL (with a selector of some sort) that can be scraped to find that list of URLs. I need to find a more intuitive way to explain and cope with this, I know.

kgrz commented 11 years ago

@jeremybmerrill I did some refactoring after facing the same issue. https://github.com/kgrz/upton/commit/4db45075bcff0083a64c96bf6525d0c9256d4723

Two changes:

  1. The main Scraper#new call would accept either a single URL or multiple URLs.
  2. If the user wants to use an Index page, he would need to wrap it with Utils.new("<some index url>", "<selector>", "xpath/css") and then pass it into the Scraper#new method.

u = Upton::Scraper.new("www.google.com", "http://www.fcc.gov/encyclopedia/fcc-and-courts")
u.scrape_to_csv(..)

i = Upton::Utils::Index.new("http://website.com/list_of_stories.html", "a#article-link", :css))
u = Upton::Scraper.new("www.google.com", i, "http://www.fcc.gov/encyclopedia/fcc-and-courts")

( Some things might break as I haven't written any tests. This is just to give an idea :) )

jeremybmerrill commented 11 years ago

This is a smart idea, @kgrz. Lemme think for a bit on the issue, but I like your idea of an Index class.

kgrz commented 11 years ago

No hurry :)

Two more reasons to consider:

It would also create an easy channel to expand the Index class's functionality if such a need arises in the future. And it makes it easier to test the library and/or the Index interface separately.

Kashyap KMBC

jeremybmerrill commented 11 years ago

@kgrz,

Oh, I'm already convinced! Great idea, seriously.

How does this sound as a game plan? Would love your feedback and, if you're interested, your contributions.

I created the moreclasses branch for this stuff. I'd love a pull request from your branch to get started.

kgrz commented 11 years ago

Yep, that looks good. How about a little renaming though:

Example:


Upton::Uri.new("<url>") do |i|
  i.login "username", "password"
end

# But what if, the login fields are similar to:
# <input type="text" name="uid"><input>
# <input type="password" name="pass"></input>
# 

Upton::Uri.new("<uri>") do |i|
  i.login :uid => "username", :pass => "password"
end

# Index page

Upton::Uri.new("<uri>") do |i|
  i.selector "//table/tr"
  i.selector_type :xpath
end

This way, if the page index needs to be fetched after logging in, it can be done. For some deeper idea of the way this would work, the Faraday gem has a similar design.

kgrz commented 11 years ago

(That said, I agree that this is too different from the point the library is right now :| )

kgrz commented 11 years ago

@jeremybmerrill Here is an example of what I was talking about: https://gist.github.com/kgrz/6095520.

In the call Upton::Scraper.new(<uri1>, <uri2>..), all the uris would be wrapped by this class. So we would have one thin wrapper for all the functionality we might be needing.

kgrz commented 11 years ago

Or, on second thought, something like this won't change the API by a big way

Instead of Index and Instance, we provide Uri and the user can override the get_index and/or get_instance method in it.


a = Upton::Uri.new(<uri>, {:xpath => "//table/tr"})

a = Upton::Uri.new(<uri>) do 
        def get_index
          # Jus' overriddin'
        end
      end

a = Upton::Uri.new(<uri>, {:xpath => "//table/tr"} ) do 
        def get_instance
          # Jus' overriddin'
        end
      end
jeremybmerrill commented 11 years ago

I will take a look at this tomorrow and respond! Was out of the country and without internet access.

kgrz commented 11 years ago

I'd suggest skipping all of what I said above altogether for now. The plan you've listed above is a really good starting point with respect to the current stage and look of the API. Most of the sugestions I made are tangential to the problem at hand.

These would be the best two things to do fir now:

  1. Get a good coverage of tests. There is still a discussion going on in the other thread about this. Once you finalize it, I'll add aome tests. (I've added some using rspec already to make the refactoring easy for me but that can be changed if need be. Rspec and minitest::spec have some things in common)
  2. Refactor as per your suggestions.

Kashyap KMBC

dannguyen commented 11 years ago

I think the better solution is to be agnostic about whether the page is an "index" or a data-page and let the user define via blocks how the scraper should act on a given type of page. Right now, it seems like it'd be awkward to scrape a nested index. Or what if an index page contains some data (other than links to end-point-pages) that the user wants to include in a CSV file (for example, the :category parameter of an index of Ebay searches). The Spidey gem is an example implementation of this. The user has to define Scraper behavior for each anticipated page type. For 95% of the use cases, that's just an index and each individual listed page, so either way, it's the same amount of work for that kind of user, but it allows the Upton framework to be a lot simpler internally, while at the same time allowing flexibility for power-use-cases. I'm not sure how the implementation would be much different than what Spidey does, but the API would definitely have to be re-worked quite a bit.

jeremybmerrill commented 11 years ago

I haven't had a chance to check out Spidey in-depth yet, but it's on my todo list.

I'll try to come up with a way to allow greater flexibility for the examples you mention. I want to strike a balance between enabling power-user cases and enabling super-simple scraping in common cases. @kgrz's idea of breaking out stashing into another gem (on which Upton'll depend) will help towards enabling power users to use the useful parts of Upton without being constrained by the choices made to enable simple cases.

jeremybmerrill commented 11 years ago

I'm thinking, too, about changing the API to be organized like ActiveRecord, where methods can be chained onto each other. I'm imagining something like (in principle):

Upton.new("http://website.com").index(".link-to-page-with-interesting-stuff").instance(&Utils.table("table#what-I-want-to-scrape")).to_csv("output.csv")

The initialize method would just create a list of URLs to scrape in the (returned) Upton object.

The index method would (however it's implemented) scrape its object's list of links and replace it with the list of scraped URLs specified by the supplied xpath/css.

The instance method would return an Upton object with the data scraped from the current list of URLs in the Upton object. Alternatively, it could just "fill in" the data portion of a mapping of URLs to the scraped data.

Each Upton object would have access to methods like to_csv, to_tsv and whatever else.

That could be used for simple cases of scraping a single page:

Upton.new("http://simplewebsite.com").instance(&Utils.table("table#what-I-want-to-scrape")).to_csv("output.csv")

Or for nested index cases too.

Upton.new("http://ebay.com/search").index("#category-label").index("#auction").instance("h1").to_csv("output.csv")

I'd love to hear y'all's thoughts. I'm (perhaps obviously) totally just fleshing this out right now.

kgrz commented 11 years ago

@dannguyen The one reason why I would separate the Index and Instance parts is so that the repetitive parts become a lot easier and still maintaining a small API. May be, we could add the ability to customize the way a page is parsed similar to what Spidey does to the Instance class itself. So we would have two built-in ways and one DIY way to define a scraper-url instance. And the Scraper class would accept all the three cases and simply run the #scrape method on the passed in list.

@jeremybmerrill That looks neat. But aren't #index and #instance methods redundant in this case? :) We could just have one method, may be #instance or #scrape which could take a Nokogiri selector or a Utils proc and then return the output.

dannguyen commented 11 years ago

@kgrz I think it's possible to keep the concept of #index and #instance type pages in the public facing API, as conveniences. But under the hood, there should be no distinction. An HTML page is an HTML page, and the parsing is all the same, whether it's to find links or to extract text values from other kinds of nodes. The problem is that as is, the initialization of an Upton instance is unnecessarily confusing, and results in a convoluted number of flows between index and instance parsing.

dannguyen commented 11 years ago

In any case, I think rewriting and reorganizing the tests, as per #6 , should be the proper course of action before radically refactoring the API. If it seems that the index and instance separation works fine for current users, no rush to change it just yet, but I think not changing it hinders making Upton as easy to maintain and make extensible in the medium to long term...but honestly, it's about as evolved as it would need to be for what it does.

jeremybmerrill commented 11 years ago

@kgrz, I agree with @dannguyen that keeping index and instance methods is important, even if they (eventually) do mostly the same thing. A main goal of the Upton project is to make scraping in most cases easy. Given the choice between domain coverage and ease-of-use, ease-of-use ought to be the priority. Obviously, we should aim to maximize domain coverage given the ease-of-use constraint.

Therefore, I am totally open to ways of creating custom scraping techniques beyond instance and index, but I want to keep those as built-in, since they cover much of the domain. Creating some sort of superclass from which those two can inherit probably makes sense.

My idea on ActiveRecord-style querying, would be to have a (yes, artificial) distinction between "things to scrape" (i.e. links) and "things having been scraped" (i.e. data). There obviously would be a mapping relationship between these. The main difference between the index and instance methods here is that the index method expects to have a link specified and it sets its return value to be list of links, whereas the instance method doesn't expect anything and sets its return value to be mere data.

But yes, I agree that better test coverage is more impt than radical refactoring [perhaps I'll start a branch, if I get inspired], though figuring out a roadmap is neat. That said, I've never written RSpec before, so I appreciate y'all's pull requests and critiques. I may have time over the weekend to work on this.

dannguyen commented 11 years ago

@jeremybmerrill I'd be careful with creating a chaining concept without getting a better idea of how people approach scraping beyond the simple scenario of indexpage -> data page.

In the simple scenario, method chaining wouldn't make things much simpler, because it's basically Upton.new.index.instance.to_csv. In more complex scenarios, the chaining may not be the clearest way to describe the order of operations, so it's hard to think of what the advantage would be. For example, if there were a situation in which the user wanted to collect a CSV off of a table from an intermediary page, as well as a separate CSV from a table at farthest leaf-pages, what would that look like with chaining?

Upton.new('index.html').instance('index-table-data.csv').to_csv('intermediary.csv').index('h2 a').instance('h5').to_csv('final-table-data.csv')

I admit that's an uncommon case but presumably, the point of reworking the API would be to allow for more complicated cases...but I think reworking it toward a chainability model may not make things significantly easier for the end user (and certainly would complicate your life as the developer :) )

jeremybmerrill commented 11 years ago

@dannguyen , I've actually stumbled into one of those more complicated cases, so I plan on using that to do some thinking about how to be more flexible to account for those sorts of patterns.