thbar / kiba

Data processing & ETL framework for Ruby
https://www.kiba-etl.org
Other
1.75k stars 87 forks source link

Support symbol as class for source/transform/destination #80

Closed xiaohui-zhangxh closed 5 years ago

xiaohui-zhangxh commented 5 years ago

Hi,

Thanks for Kiba makes us easier to manage data processing, when we write Kiba jobs, we noticed there are many times defining the same class for source/transform/destination, the class name is not short to write or remember. So we try to add this feature to make life beter:

# config/initializers/kiba.rb
Kiba.register_sources :http => Kiba::Tanmer::Sources::HttpClient
Kiba.register_transforms :parse_doc => Kiba::Tanmer::Transforms::ParseDoc,
                         :xml_select => Kiba::Tanmer::Transforms::XMLSelector,
                         :link => Kiba::Tanmer::Transforms::LinkToHash
Kiba.register_destinations :model, Kiba::Tanmer::Destinations::ModelStore

# app/etls/my_job.etl
Kiba.parse do
  source :http, 'https://google.com'
  transform :parse_doc
  transform :xml_select, selector: 'a'
  transform :link
  destination :model, Link, key: :href
end

See changes: https://github.com/tanmer/kiba/commit/325b23ed0b6e6dcd04ad4064ea3258166ba35989

Is this possible to merge this feature to Kiba?

thbar commented 5 years ago

Hi @xiaohui-zhangxh!

Thanks for using Kiba!

For the time being, I will not merge this into Kiba itself, because I want to keep Kiba's core very lean for as long as possible, to make sure it's easy to maintain for the coming years (that's a high priority for me).

That said, do not dispair :smile: as there is a way to achieve what you are doing, without modifying Kiba itself & without big modifications.

I did a bit of work and came up with this, which respects block transforms if you also want them:

module DSLExtensions
  module Registry
    def self.setup(context)
      context.instance_eval do
        # NOTE: this could also likely be done with some form of included/extended hook
        # and method aliasing, in a way or another.
        @source_before_registry = method(:source)
        @transform_before_registry = method(:transform)
        @destination_before_registry = method(:destination)
        extend DSLExtensions::Registry
      end
    end

    def register_sources(hash)
      @sources_mapping = hash
    end

    def register_transforms(hash)
      @transforms_mapping = hash
    end

    def register_destinations(hash)
      @destinations_mapping = hash
    end

    def source(key, *args)
      klass = @sources_mapping.fetch(key, key)
      @source_before_registry.call(klass, *args)
    end

    def transform(*args, &block)
      if block
        @transform_before_registry.call(&block)
      else
        key, remaining_args = *args
        klass = @transforms_mapping.fetch(key, key)
        @transform_before_registry.call(klass, *remaining_args)
      end
    end

    def destination(key, *args)
      klass = @destinations_mapping.fetch(key, key)
      @destination_before_registry.call(klass, *args)
    end
  end
end

As an example, let's imagine we have the following Kiba components:

class MultiplyTransform
  def initialize(factor)
    @factor = factor
  end

  def process(row)
    row * @factor
  end
end

class ShowTransform
  def process(row)
    puts row.inspect
    row
  end
end

class ArrayDestination
  def initialize(array)
    @storage = array
  end

  def write(row)
    @storage << row
  end
end

You can then use that inside your job definition (which shows a mix of symbol declarations, but also class & block):

result = []
Kiba.run(Kiba.parse do
  DSLExtensions::Registry.setup(self)

  register_sources(
    enumerable: Kiba::Common::Sources::Enumerable
  )

  register_transforms(
    multiply: MultiplyTransform,
    show: ShowTransform
  )

  register_destinations(
    array: ArrayDestination
  )

  source :enumerable, (1..5)
  transform :multiply, 10
  transform MultiplyTransform, 10
  transform { |r| r * 10 }
  transform :show

  destination :array, result = []
end)

Here the mappings are defined inside each job for extra flexibility, but you could also ensure the setup call will just use a general mapping, defined in a module somewhere, e.g.:

module Tanmer
  module Mappings
    TRANSFORM_REGISTRY = { ... }.freeze
    SOURCE_REGISTRY = ...
    DESTINATION_REGISTRY = ...
  end
end

Also be aware that another way to have global shared setup is to do something like:

module Tanmer
  module BaseKibaJob
    def setup(config, &declaration)
      Kiba.parse do
        # setup registry here
        ...
       # then let the caller issue more declarations
        instance_eval(&declaration)
      end
    end
  end
end

To be used with:

job = Tanmer::BaseKibaJob.setup(config) do
  transform :multiply # this will work here
end

Last word, I'd still recommend you to implement some specs around the registry behaviour, just to ensure you are warned, in case major changes occur in Kiba itself.

@xiaohui-zhangxh hope this will work nicely for you! Let me know how it goes. I will keep this issue opened until I get your feedback.

xiaohui-zhangxh commented 5 years ago

Wow, thanks for writing such detailed code to me. I'm Vue fan, when I saw below coding style, I realized Kiba is planing to be a big framework, which let a lot of programmers share their sources/transforms/destinations, like Vue supports extensions, I will change my code as this guide.

module DSLExtensions
  module Registry
    def self.setup(context)
    end
  end
end
thbar commented 5 years ago

Thanks! I will close the issue then!

Indeed the focus is on components & DSL extensions re-use - I will provide more guides in the future around that area.

Thanks for your feedback as well, appreciated!