news-scraper / news_scraper

Simple ETL news scraper in Ruby
MIT License
4 stars 0 forks source link

NewsScraper

Simple ETL news scraper in Ruby

RubyGems

A collection of extractors, transformers and loaders for a variety of news feeds and outlets.

Installation

Add this line to your application's Gemfile:

gem 'news_scraper'

And then execute:

$ bundle

Or install it yourself as:

$ gem install news_scraper

Usage

Scraping

NewsScraper::Scraper#scrape will return an array of the transformed data for all Google News RSS articles for the given query.

Optionally, you can pass in a block and it will yield the transformed data on a per-article basis.

It takes in 1 parameter query:.

Array notation

article_hashes = NewsScraper::Scraper.new(query: 'Shopify').scrape # [ { author: ... }, { author: ... } ... ]

Note: the array notation may raise NewsScraper::Transformers::ScrapePatternNotDefined (domain is not in the configuration) or NewsScraper::ResponseError (non-200 response), for this reason, it is suggested to use the block notation where this can be handled properly

Block notation

NewsScraper::Scraper.new(query: 'Shopify').scrape do |a|
  case a.class.to_s
  when "NewsScraper::Transformers::ScrapePatternNotDefined"
    puts "#{a.root_domain} was not trained"
  when "NewsScraper::ResponseError"
    puts "#{a.url} returned an error: #{a.error_code}-#{a.message}"
  else
    # { author: ... }
  end
end

How the Scraper extracts and parses for the information is determined by scrape patterns (see Scrape Patterns).

Transformed Data

Calling NewsScraper::Scraper#scrape with either the array or block notation will yield transformed_data hashes. article_scrape_patterns.yml defines the data types that will be scraped for.

In addition, the url and root_domain(hostname) of the article will be returned in the hash too.

Example

{
  author: 'Linus Torvald',
  body: 'The Linux kernel developed by Linus Torvald has become the backbone of most electronic devices we use to-date. It powers mobile phones, laptops, embedded devices, and even rockets...',
  description: 'The Linux kernel is one of the most important contributions to the world of technology.',
  keywords: 'linux,kernel,linus,torvald',
  section: 'technology',
  datetime: '1991-10-05T12:00:00+00:00',
  title: 'Linus Linux',
  url: 'https://linusworld.com/the-linux-kernel',
  root_domain: 'linusworld.com'
}

Scrape Patterns

Scrape patterns are xpath or CSS patterns used by Nokogiri to extract relevant HTML elements.

Extracting each :data_type (see Example under Transformed Data) requires a scrape pattern. A few :presets are specified in article_scrape_patterns.yml.

Since each news site (identified with :root_domain) uses a different markup, scrape patterns are defined on a per-:root_domain basis.

Specifying scrape patterns for new, undefined :root_domains is called training (see Training).

Customizing Scrape Patterns

NewsScraper.configuration is the entry point for scrape patterns. By default, it loads the contents of article_scrape_patterns.yml, but you can override this with the fetch_method which accepts a proc.

For example, to override the domains section we can do this like so:

@default_configuration = NewsScraper.configuration.scrape_patterns.dup
NewsScraper.configure do |config|
  config.fetch_method = proc do
    @default_configuration['domains'] = { ... }
    @default_configuration
  end
end

Of course, using this method you can override any part of the configuration individually, or the entire thing. It is fully customizeable.

This helps with separate apps which may track domains training itself. If the configuration is not set correctly, a newly trained domain will not be in the configuration and a NewsScraper::Transformers::ScrapePatternNotDefined error will be raised.

It would be appreciated that any domains you train outside of this gem eventually end up as a pull request back to article_scrape_patterns.yml.

Training

For each :root_domain, it is neccesary to specify a scrape pattern for each of the :data_types. A rake task was written to provide a CLI for appending new :root_domains using :preset scrape patterns.

Simply run

bundle exec rake scraper:train QUERY=<query>

where the CLI will step through the articles and :root_domains of the articles relevant to <query>.

Of course, this will simply create an entry for a domain with domain_entries, so as long as your application provides the same functionality, this can be overriden in your app. Just provide a domain entry like so:

domains:
  root_domain.com:
    author:
      method: method
      pattern: pattern
    body:
      method: method
      pattern: pattern
    description:
      method: method
      pattern: pattern
    keywords:
      method: method
      pattern: pattern
    section:
      method: method
      pattern: pattern
    datetime:
      method: method
      pattern: pattern
    title:
      method: method
      pattern: pattern

The options using the presets in article_scrape_patterns.yml, can be obtained using this snippet:

include NewsScraper::ExtractorsHelpers

transformed_data = NewsScraper::Transformers::TrainerArticle.new(
  url: url,
  payload: http_request(url).body
).transform

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/news-scraper/news_scraper. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the MIT License.