Open adelevie opened 11 years ago
@adelevie, the immediate fix is to put url in a single-element array, as
Upton::Scraper.new([url])
In Scraper.new(), Upton expects a list of pages to apply xpath
OR a single page URL (with a selector of some sort) that can be scraped to find that list of URLs. I need to find a more intuitive way to explain and cope with this, I know.
@jeremybmerrill I did some refactoring after facing the same issue. https://github.com/kgrz/upton/commit/4db45075bcff0083a64c96bf6525d0c9256d4723
Two changes:
Scraper#new
call would accept either a single URL or multiple URLs. Utils.new("<some index url>", "<selector>", "xpath/css")
and then pass it into the Scraper#new
method.
u = Upton::Scraper.new("www.google.com", "http://www.fcc.gov/encyclopedia/fcc-and-courts")
u.scrape_to_csv(..)
i = Upton::Utils::Index.new("http://website.com/list_of_stories.html", "a#article-link", :css))
u = Upton::Scraper.new("www.google.com", i, "http://www.fcc.gov/encyclopedia/fcc-and-courts")
( Some things might break as I haven't written any tests. This is just to give an idea :) )
This is a smart idea, @kgrz. Lemme think for a bit on the issue, but I like your idea of an Index
class.
No hurry :)
Two more reasons to consider:
It would also create an easy channel to expand the Index class's functionality if such a need arises in the future. And it makes it easier to test the library and/or the Index interface separately.
Kashyap KMBC
@kgrz,
Oh, I'm already convinced! Great idea, seriously.
How does this sound as a game plan? Would love your feedback and, if you're interested, your contributions.
Index
classInstance
class (I described these basically as classes in README.md
; why I didn't implement them that way at first eludes me.)Page
class that Index
and Instance
both inherit from.I created the moreclasses
branch for this stuff. I'd love a pull request from your branch to get started.
Yep, that looks good. How about a little renaming though:
Instance
(which, I presume is the class that would give the user options to login) and Index
, shall we have a Upton::Uri
which would do both? Example:
Upton::Uri.new("<url>") do |i|
i.login "username", "password"
end
# But what if, the login fields are similar to:
# <input type="text" name="uid"><input>
# <input type="password" name="pass"></input>
#
Upton::Uri.new("<uri>") do |i|
i.login :uid => "username", :pass => "password"
end
# Index page
Upton::Uri.new("<uri>") do |i|
i.selector "//table/tr"
i.selector_type :xpath
end
This way, if the page index needs to be fetched after logging in, it can be done. For some deeper idea of the way this would work, the Faraday
gem has a similar design.
(That said, I agree that this is too different from the point the library is right now :| )
@jeremybmerrill Here is an example of what I was talking about: https://gist.github.com/kgrz/6095520.
In the call Upton::Scraper.new(<uri1>, <uri2>..)
, all the uris would be wrapped by this class. So we would have one thin wrapper for all the functionality we might be needing.
Or, on second thought, something like this won't change the API by a big way
Instead of Index
and Instance
, we provide Uri
and the user can override the get_index
and/or get_instance
method in it.
a = Upton::Uri.new(<uri>, {:xpath => "//table/tr"})
a = Upton::Uri.new(<uri>) do
def get_index
# Jus' overriddin'
end
end
a = Upton::Uri.new(<uri>, {:xpath => "//table/tr"} ) do
def get_instance
# Jus' overriddin'
end
end
I will take a look at this tomorrow and respond! Was out of the country and without internet access.
I'd suggest skipping all of what I said above altogether for now. The plan you've listed above is a really good starting point with respect to the current stage and look of the API. Most of the sugestions I made are tangential to the problem at hand.
These would be the best two things to do fir now:
Kashyap KMBC
I think the better solution is to be agnostic about whether the page is an "index" or a data-page and let the user define via blocks how the scraper should act on a given type of page. Right now, it seems like it'd be awkward to scrape a nested index. Or what if an index page contains some data (other than links to end-point-pages) that the user wants to include in a CSV file (for example, the :category parameter of an index of Ebay searches). The Spidey gem is an example implementation of this. The user has to define Scraper behavior for each anticipated page type. For 95% of the use cases, that's just an index and each individual listed page, so either way, it's the same amount of work for that kind of user, but it allows the Upton framework to be a lot simpler internally, while at the same time allowing flexibility for power-use-cases. I'm not sure how the implementation would be much different than what Spidey does, but the API would definitely have to be re-worked quite a bit.
I haven't had a chance to check out Spidey in-depth yet, but it's on my todo list.
I'll try to come up with a way to allow greater flexibility for the examples you mention. I want to strike a balance between enabling power-user cases and enabling super-simple scraping in common cases. @kgrz's idea of breaking out stashing into another gem (on which Upton'll depend) will help towards enabling power users to use the useful parts of Upton without being constrained by the choices made to enable simple cases.
I'm thinking, too, about changing the API to be organized like ActiveRecord, where methods can be chained onto each other. I'm imagining something like (in principle):
Upton.new("http://website.com").index(".link-to-page-with-interesting-stuff").instance(&Utils.table("table#what-I-want-to-scrape")).to_csv("output.csv")
The initialize
method would just create a list of URLs to scrape in the (returned) Upton object.
The index
method would (however it's implemented) scrape its object's list of links and replace it with the list of scraped URLs specified by the supplied xpath/css.
The instance
method would return an Upton object with the data scraped from the current list of URLs in the Upton object. Alternatively, it could just "fill in" the data portion of a mapping of URLs to the scraped data.
Each Upton object would have access to methods like to_csv
, to_tsv
and whatever else.
That could be used for simple cases of scraping a single page:
Upton.new("http://simplewebsite.com").instance(&Utils.table("table#what-I-want-to-scrape")).to_csv("output.csv")
Or for nested index cases too.
Upton.new("http://ebay.com/search").index("#category-label").index("#auction").instance("h1").to_csv("output.csv")
I'd love to hear y'all's thoughts. I'm (perhaps obviously) totally just fleshing this out right now.
@dannguyen The one reason why I would separate the Index
and Instance
parts is so that the repetitive parts become a lot easier and still maintaining a small API. May be, we could add the ability to customize the way a page is parsed similar to what Spidey does to the Instance
class itself. So we would have two built-in ways and one DIY way to define a scraper-url instance. And the Scraper
class would accept all the three cases and simply run the #scrape
method on the passed in list.
@jeremybmerrill That looks neat. But aren't #index
and #instance
methods redundant in this case? :) We could just have one method, may be #instance
or #scrape
which could take a Nokogiri selector or a Utils proc and then return the output.
@kgrz I think it's possible to keep the concept of #index and #instance type pages in the public facing API, as conveniences. But under the hood, there should be no distinction. An HTML page is an HTML page, and the parsing is all the same, whether it's to find links or to extract text values from other kinds of nodes. The problem is that as is, the initialization of an Upton instance is unnecessarily confusing, and results in a convoluted number of flows between index and instance parsing.
In any case, I think rewriting and reorganizing the tests, as per #6 , should be the proper course of action before radically refactoring the API. If it seems that the index
and instance
separation works fine for current users, no rush to change it just yet, but I think not changing it hinders making Upton as easy to maintain and make extensible in the medium to long term...but honestly, it's about as evolved as it would need to be for what it does.
@kgrz, I agree with @dannguyen that keeping index and instance methods is important, even if they (eventually) do mostly the same thing. A main goal of the Upton project is to make scraping in most cases easy. Given the choice between domain coverage and ease-of-use, ease-of-use ought to be the priority. Obviously, we should aim to maximize domain coverage given the ease-of-use constraint.
Therefore, I am totally open to ways of creating custom scraping techniques beyond instance
and index
, but I want to keep those as built-in, since they cover much of the domain. Creating some sort of superclass from which those two can inherit probably makes sense.
My idea on ActiveRecord-style querying, would be to have a (yes, artificial) distinction between "things to scrape" (i.e. links) and "things having been scraped" (i.e. data). There obviously would be a mapping relationship between these. The main difference between the index
and instance
methods here is that the index
method expects to have a link specified and it sets its return value to be list of links, whereas the instance
method doesn't expect anything and sets its return value to be mere data.
But yes, I agree that better test coverage is more impt than radical refactoring [perhaps I'll start a branch, if I get inspired], though figuring out a roadmap is neat. That said, I've never written RSpec before, so I appreciate y'all's pull requests and critiques. I may have time over the weekend to work on this.
@jeremybmerrill I'd be careful with creating a chaining concept without getting a better idea of how people approach scraping beyond the simple scenario of indexpage -> data page.
In the simple scenario, method chaining wouldn't make things much simpler, because it's basically Upton.new.index.instance.to_csv
. In more complex scenarios, the chaining may not be the clearest way to describe the order of operations, so it's hard to think of what the advantage would be. For example, if there were a situation in which the user wanted to collect a CSV off of a table from an intermediary page, as well as a separate CSV from a table at farthest leaf-pages, what would that look like with chaining?
Upton.new('index.html').instance('index-table-data.csv').to_csv('intermediary.csv').index('h2 a').instance('h5').to_csv('final-table-data.csv')
I admit that's an uncommon case but presumably, the point of reworking the API would be to allow for more complicated cases...but I think reworking it toward a chainability model may not make things significantly easier for the end user (and certainly would complicate your life as the developer :) )
@dannguyen , I've actually stumbled into one of those more complicated cases, so I plan on using that to do some thinking about how to be more flexible to account for those sorts of patterns.
Code:
Error:
The readme example uses a
String
as the argument forUpton::Scraper::new
, but for some reason something that responds to#each_with_index
is expected here. It could be that thexpath
isn't good, but if so, the error shouldn't be pointing to theString
provided on line 8.