propublica / upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
MIT License
1.62k stars 113 forks source link

More test coverage, more idiomatic tests #6

Open brianflanagan opened 11 years ago

brianflanagan commented 11 years ago

This maybe isn't an issue, but seems like it should be avoidable. I imagine this should all be fairly straightforward to test without starting and stopping thin.

Would you be up for an rspec'd pull request?

jeremybmerrill commented 11 years ago

Hi Brian,

Yeah, the tests do take a long time. I'm not a testing expert, and I was mostly interested in getting tests written, rather than making them perfect. As such, I'm not married to test-unit; if you wanna submit a rspec pull request, I'd totally be open to it. Thanks for contributing!

brianflanagan commented 11 years ago

Awesome. Yeah - definitely no criticism intended. Tests > no tests.

jeremybmerrill commented 11 years ago

None taken. :) Looking forward to your PR.

jeremybmerrill commented 11 years ago

My colleague asks: Can we do minitest instead of rspec?

brianflanagan commented 11 years ago

Yeah, absolutely. However, I've never used minitest, so I may not be up for taking the lead on it. Will investigate after work today or tomorrow.

On Tue, Jul 23, 2013 at 3:13 PM, Jeremy B. Merrill <notifications@github.com

wrote:

My colleague asks: Can we do minitest instead of rspec?

— Reply to this email directly or view it on GitHubhttps://github.com/propublica/upton/issues/6#issuecomment-21416666 .

jeremybmerrill commented 11 years ago

Okay. Rumor has it that they're similar, but I don't really know from experience. How would you, under RSpec, avoid starting the server so often?

Perhaps I ought to only test the fetching-related methods with the server on, and the rest should just load the page from disk. (Because then I'm not actually testing stuff that needs teh server.)

I'd love to hear your thoughts...

jeremybmerrill commented 11 years ago

@brianflanagan, I converted the tests to RSpec in bcaa85703d; they now run way faster (31sec on my machine). If you have any other ideas on how to improve the tests, please don't hesitate to let me know (or submit a PR). :)

brianflanagan commented 11 years ago

Fantastic - I'll take a look. Apologies for flaking on this; still intend to submit a pull request or two.

jeremybmerrill commented 11 years ago

Awesome, thanks! If you have any RSpec tips wrt to what I wrote, I'm happy to hear them. I'm not a testing expert...

dannguyen commented 11 years ago

It's not the test framework that is slow, it's the use of a webserver. At some point, the dependence on Thin should be removed and replaced with Fakeweb to simulate HTTP responses: https://github.com/chrisk/fakeweb

The HTML parsed should also be vastly simplified...but totally understandable that it's more comfortable with working with HTML you've looked at many times before.

brianflanagan commented 11 years ago

+1

There's no need to test thin in the tests for upton. The server responses can probably be stubbed.

kgrz commented 11 years ago

I have some webmock/rspec stuff written for the downloading and caching part. I suppose after that, the server tests can be removed.

jeremybmerrill commented 11 years ago

Fakeweb looks awesome. Makes sense to replace my silly Thin stuff with it.

As far as simplifying the HTML goes, do y'all suggest bare HTML pages with nothing but the tested element, or just removing the Wikipedia/ProPublica headers, etc. on the test cases and leaving the scraped content mostly the same?

dannguyen commented 11 years ago

Yeah...including the full pages makes things slower on two fronts...one, with just the opening and parsing of the pages, and two, for other contributors to read. For example, some of the headlines scraped on the sample pages are repeated (in sidebars and widgets)...if you write the tests with a single purpose, such as: Does Upton use the given selector to find the expected hrefs, then it shouldn't matter what else is on the page.

A possible strategy at this time is to move the current full-page tests into their own directory...as you write actual unit tests, those full page tests should still pass (until they don't, whenever you've decided to change the API)

jeremybmerrill commented 11 years ago

You all will be pleased to learn that I got rid of that Thin server and used webmocks. :)

Removing the extraneous page structure is still tk.