tamouse / scrapers

Web site scrapers using Mechanize and other goodies.
http://github.com/tamouse/scrapers
MIT License
4 stars 0 forks source link

Reorganize and refactor #4

Open tamouse opened 9 years ago

tamouse commented 9 years ago

There are several times I've been working in this repo that I think "wouldn't it be neat if ..." and have been a little stymied because it would have meant I'd need to restructure things. I've also learned quite a bit since I've started this collection, and many parts could use a face lift.

Structure Revision

The directory structure is haphazard, and the way command line bits are implemented is inconsistent.

Concept

#!/usr/bin/env ruby
require 'scrapers/rubytapas`
Scrapers::RubyTapas::CLI.new(ARGV).start

Example:

lib/
  scrapers/
    rubytapas/
      scraper.rb
      cli.rb

Non-scraper-specific things will be in lib/ for easy name-spacing. Example, the .netrc reader:

lib/
  netrc_reader.rb
require 'netrc'
class NetrcReader
  # ...
end
tamouse commented 9 years ago

So, after refactoring rubytapas, my tests have blown up with mocks and stubs. Clearly, I'm going about this the wrong way.

This time, I'm going to design things a bit differently. For the rubytapas downloader, I have the following needs:

What I need is a service gateway object that encapsulates the interactions with dpdcart. This should be built entirely separately from the rest of the script. Currently that functionality is split between a couple of classes, and there's a bit of repetition there.

The scraping part is actually pretty simple as it only cares about scraping the attachments that are given in the rss feed item.

So, new structure:

lib/
  scrapers/
    rubytapas/
      cli.rb  -- class implementing the thor script bits
      dpdcart.rb -- class implementing the dpdcart gateway object
      scraper.rb -- class implementing the feed scraper
      episode.rb -- class implementing an episode structure
      file_list.rb -- class implementing the download file list