molybdenum-99 / tlaw

The Last API Wrapper: Pragmatic API wrapper framework
MIT License
125 stars 9 forks source link

Extract Response Processing #7

Open joelvh opened 6 years ago

joelvh commented 6 years ago

The goal is to extract all processing of the response in order to implement ETL-like post processing and streaming of HTTP requests.

Changes:

joelvh commented 6 years ago

@zverok this is mainly a placeholder PR to get some feedback on the direction of things. This is primarily meant to keep things backward-compatible (except for classes where the namespace was changed, but should not affect users of the DSL).

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.1%) to 95.856% when pulling 01fbcc3e8e74f0447859d38cfb61d0f04b7e84db on joelvh:master into a4c747004852671dc60ab45880737703916b375b on molybdenum-99:master.

joelvh commented 6 years ago

@zverok your comments have been addressed except for 2 I left comments for.

The Processors namespace is mostly a placeholder right now.

Regarding your reasoning, I think it all makes sense. In terms of my goals, I would want to keep the current TLAW behavior and DSL in tact. However, the underlying architecture would become more modular and focused on building pipelines (or middleware) to handle processing of HTTP responses. Ideally, the DSL can have the process (post_process) helper methods to build up the ETL-like transformations, so out-of-the-box, TLAW has most of what people want for simple situations.

The additions I'm envisioning are not far from what is already there. However, I would want to achieve:

I'm developing an "abstract" ETL pipeline project that can encapsulate the workflows and use existing gems under-the-hood to execute parts of the process (e.g. TLAW can be used to create an API client to stream data into an ETL pipeline)

So all that to say that the current TLAW behavior would ideally stay intact or if it makes sense, some default behavior may evolve.

zverok commented 6 years ago

However, the underlying architecture would become more modular and focused on building pipelines (or middleware) to handle processing of HTTP responses.

Yes, that's one of my goals (that I've never had time to fully embrace for this project). So, let's go with it :+1:

BTW, as you've mentioned it, I, in fact, developed Hm somewhat with TLAW in mind, so probably (on some further steps) the possibility could be investigated to do things in "Hm-like" style:

transform('items', :*, 'date', &Date.method(:parse))
# instead of
transform_items('items') do
  transform 'date', &Date.method(:parse))
end

(The former now looks "cleaner" for me, though it contradicts open_weather_map example, with WEATHER_PROCESSOR reused on several levels.)

But that's for future (also, transform_items maybe should be transform_each?.. Looks more Ruby, don't remember why I wanted _items suffix in the first place.)

PS: On a totally unrelated note, yesterday evening younger colleague asked for help with some Magento integration, and we struggled for a few hours with rough SOAP/Savon, and then I googled for ready solution, and suddenly found magento-rb and everything worked fine just out-of-the-box, and then I was like "Hey, I know this guy!" So, thanks, I guess :)

joelvh commented 6 years ago

@zverok Ha! What a funny coincidence that you guys found that Magento project. Glad it was of help!

Regarding multiple PRs: I tried to largely group updates into commits, so I would initially recommend looking through the commits as a means to understand the changes. If that isn't possible, I'll see about multiple PRs, if that's OK?

Regarding transformations: Yes, Hm was the project, and as a matter of fact, it looks like you just started that very recently. Solnic's Transproc is the other one, and both of these have lots of common transformations built-in. My goal is to develop a set of gems that can help with solving various use cases, that can all be used together. Primarily, the goal is to provide DSLs to construct ETL workflows or data migrations, report generation, and other things in an abstract way that easily lets you use the tools you want to use for things like transformations. (For instance, take the TLAW DSL for defining the workflow and transformations. The transformations could be performed by Hm, Transproc, or both.) A bit ambitious, I realize, but that's the goal. We can talk more about this if you're interested... for now I'm using my reporting gem as a workspace for these concepts, and will abstract out various aspects of that project into individual gems for composition for individual use cases (eventually).

Thanks for reviewing things!

joelvh commented 6 years ago

@zverok I should note that the initial abstraction in reporting builds on top of Kiba, but Kiba is not a direct dependency. The Pipeline DSL allows for constructing the transformations, and then the job can be run by Kiba (or another framework).

Some examples in the project to illustrate the ideas:

  1. Developing DSL to construct "phases" of a pipeline (heavily influenced by Kiba, but going to evolve from there to extend beyond Kiba's capabilities)
  2. A simple DSL for adding a transformation
  3. Example of some transforms that Hm or Transproc are better suited to provide the implementation for
  4. Example of what a job definition might look like (needs to be filled in)
  5. And finally, here's how the pipeline is actually run by Kiba

I like how you developed a DSL object model that wraps around the domain object model. I will explore that approach a bit more in order to not mix helper methods with the data model. (As you will see in some of the changes I made to the transforms in TLAW, I moved what I would consider helpers to the DSL namespace in this manner.)

Thanks!

zverok commented 6 years ago

Hey, sorry for being silent for a week, life and work took too much of my attention.

I generally like the notion of "pipelines", and lately, too, become to think that TLAW should fit in some pipelines (therefore, probably requiring to inherit from it is not the best convention possible). My main use case (which led me to "inventing" the library) is the reality project (it is work-in-progress, here is the latest presentation from RubyConf India). It basically is made of "describers" (high-level homogenous wrappers) for external data services. The describer also can be seen as a pipeline (e.g. fetch Wikipedia page → parse it to wiki-AST → fetch some semantic information → wrap it into high-level value objects like "Temperature" or "Geographical coordinate")

TLAW usage there can be seen, for example, here. I have mixed feelings about how it fit in: on the one hand, it was really easy to describe the data source I wanted, on the other, I needed an "internal" class to descend from TLAW, while probably ability to somehow "include" the DSL into the pipeline could be better.

joelvh commented 6 years ago

I was out of town and got a chance to look at your links now. I saw Reality when I looked through your projects a while back and thought it was very interesting. Your slides are also helpful to walk through your thinking. I have some thoughts around similar ideas of what can be built on top of TLAW and pipelines.

Reality is more designed around entities that can be used across data sources, which is neat, but also means that each data source needs to have various higher level translations to make them work together (presumably your "describers"). I will give more thought to this and how it can be done more generically and if it's in the realm of where I've been thinking of going. However, the way you are using TLAW as the API client and building on it is exactly what I like.

As I think about data sources, transformations and peristence (more related to ETL), I also think a little bit about making it more event-oriented and potentially some parts async. However, I think that is higher-level pipelines than we are currently talking about.

I will play more with TLAW concepts and how some of my goals in the previous comment fit together as I get my head around this again. It sounds like you are open to determining TLAW's place in a system of pipelines.

Let me know if you have any specific thoughts about the relationship between TLAW, pipelines and how to integrate multiple libraries together (e.g. TLAW for API client, interchangeable Hmor transproc for pre-built transformations, etc).

joelvh commented 6 years ago

@zverok I've been looking more into ROM.rb and it looks like the HTTP adapter does what we are doing with request and response handling. There is a ROM HTTP adapter and I also found an example of a Faraday ROM HTTP adapter example.

ROM.rb is also very functional and pipeline-based. It seems like it could be a good foundation for ETL as I've been thinking about, and TLAW would be great for customizing the HTTP client/adapter. I'm going to give more thought about if/how ROM fits into things, but I wanted to see what your thoughts are.

joelvh commented 6 years ago

@zverok I've recreated commits for the remaining changes on this PR. It's two commits that encompass renaming the post-processor methods and extracts the response processor to a configurable class.

zverok commented 6 years ago

Let me know if you have any specific thoughts about the relationship between TLAW, pipelines and how to integrate multiple libraries together (e.g. TLAW for API client, interchangeable Hmor transproc for pre-built transformations, etc).

I have no clear idea, currently, mostly because I haven't experimented enough. The generic concerns are:

  1. How to make processing of data from Web APIs flexible. My initial guess (especially opinionated "flattening") was definitely too rigid; the "process this and that element" API is better, so your work to make it all more modular and reasonable is extremely valuable.
  2. I am not sure that "separate TLAW class" is always the best approach, it again lacks flexibility. I recently thought about that exactly in context of post-processing: on any non-trivial postprocessing block, the nesting becomes ridiculous. It would be probably nice to have something like post_process :some_private_method_name.
  3. Also (in context of Reality) I thought about if it is possible to think about some "TLAW inside, other interface outside classes", like
class SpecializedWeather
  def very_concrete_method
    search(city).weather(date) # that's private internal TLAW
      .yield_self { some very fancy wrapper }
  end

  private

  # all those methods become availabel to instance yet private.
  tlaw do 
     ........
  end 
end

I am not sure it will be good, just an idea.

As about ROM, I believe that they approach the same problem here, but from a bit different angle. Their is mostly "apply repository pattern to everything" (including HTTP, implementing something like Repository-based ActiveResource); mine is mostly utilitarian, some small tool to be chainable into bigger ones. Maybe the approaches can be joined somehow, IDK for sure :)

joelvh commented 6 years ago

Thanks for your thoughts @zverok - I think we are pretty aligned in our thinking. I am getting more into the ROM/Dry/Hanami world and will report back about what I see the fit as.

The one thing that I have been thinking about a lot lately is the value of DSLs as a high-level interface to describe what you want your application to do, and making the implementation details configurable. I really like the Container/IoC patterns to make this even more flexible. Similar to how you describe specifying a method name for some custom transformation, that could very well refer to a key in a container -- because I definitely agree with you that post processing via nested blocks can become very unwieldy in non-trivial situations. To break it down, parts of it should probably be their own methods or classes that you compose together. Hence, I really am interested in looking at what can be done by developing a DSL to handle these complexities, which will then build the composition of the pipeline with your various implementation details behind the scenes.

I'll report back with what I find as I get more into those libraries and communities I mentioned.

In the meantime, what are your thoughts about merging this PR? (I had originally aliased methods to make the transformation methods backward-compatible. If you have concerns about the API changes, that is an option. My guess is that this could be a drop-in for you if you add back the aliases.)

zverok commented 6 years ago

Well, let's try this. I mean, if all the examples and docs would show it is clear enough -- I don't mind getting more functional here :) I reviewed the examples where "mutating" post_process were used, and they all probably can be rewritten in non-mutating style, probably with more clarity even:

# before
post_process { |e|
  e['coord'] = Geo::Coord.new(e['coord.lat'], e['coord.lon']) if e['coord.lat'] && e['coord.lon']
}
post_process('coord.lat') { nil }
post_process('coord.lon') { nil }

# after 
post_process { |response|
  next response unless response.key?('coord.lon') && response.key?('coord.lat')
  response.merge(
    'coord' => Geo::Coord.new(response.delete('coord.lat'), response.delete('coord.lon'))
  )
}

Then, probably the most readable API would be:

transform { |whole_response| should return next part }
transform(key) { |key value| should return new key value }
transform_array(key) do
  transform { ... }
end

About the last name: I'm still feeling a bit unsure about it, but neither transform_items, nor transform_each, nor transform_map seems to communicate "what it does" (what are "items"? "each" what? "map" to where?..). _array is dull yet straight.

I also thought about hm-style replacement:

transform('list', :*, 'sys.sunrise', &Time.method(:at))
transform('list', :*, 'weather.icon') { |i| "http://openweathermap.org/img/w/#{i}.png" }
...

...but on this "model" example (OpenWeatherMap) it doesn't bring any clarity at all (and "join coord.lon & coord.lat into coord" seems even more cumbersome).

Probably this part is subject to further investigations.

joelvh commented 6 years ago

Here are some initial thoughts I had ruminating, but have not yet thought through all use cases.

If we consider a hierarchy of how the response is processed, maybe we can do something that I saw in a ROM video demo (but can't seem to find an example in their docs).

There was an example of two ways to define mappings. First was to use a block with the entity as an argument. The second was a block without an argument where you could call transformation methods that implicitly applied to the entity. The first was for more custom mapping (e.g. mutations or whatever you want) and returning a result. The second was to describe the transformations via the use of the transformation DSL. This is where I see either taking from ROM or implementing some aspect of ROM.

Anyways, maybe the DSL can be something like this, going with the implicit version we've been using in TLAW already:

# Hook into transformation process, responsible for returning the full result
transform do
  # This would map over whole response, assuming the whole response
  # is an array or just processes the whole response as a single item
  map do # item in the array is implicit
    rename old_key: :new_key
    # change value
    map(key) { |value| ... new value ... }
    # could have an option to convert to array if it's a single item
    map(key, ensure_array: true) { |value| ... new value ... }

    # Many more transformation helpers

  end
end

# Other custom approach
transform do |response, transformer|
  # do stuff to the response, maybe create a new object to copy resulting data to
  result = response # do stuff

  # use transformer to take advantage of transform DSL
  transformer.for result do
    # this is the same context as `transform` without arguments
  end
end

I realize this example isn't clear about what gets included or excluded in the resulting output (e.g. if keys you don't transform get added or get skipped in output, etc).

Thoughts on this approach, which describes the data hierarchy more?

joelvh commented 6 years ago

@zverok I added a second example with more customizability to the previous comment

zverok commented 6 years ago

Hey. Sorry for the late answer :( Mindful responding requires a lot of my attention that I am currently short of. The thing is, I believe, that good new or updated transformations DSL hardly can be "guessed". I committed some examples (I've had them for long in gitignored folder) to examples/experimental, and with preexisting examples/*.rb it is some material to experiment with.

What I currently can say, probably: in fact, "rich DSL for hash transformations" is probably out of the scope of TLAW: it is the road that can be walked very far, and in fact, nothing implies that "hash transformations" should be part of "HTTP API wrapper" library. I started with "opinionated" library which transformed everything, but we already discussed that it was probably a false move (unlike endpoints definition API which is flexible and not assumes anything about client code).

So, what I am thinking now in light of this idea, is that probably this should be only one and ultimate interface:

endpoint ... do
   transform &anything
end

The only thing that does part of TLAW API defines is that you can specify some transformations -- just with any block-alike object. Now, this object can be produced by transproc-alike library, or be some MyCustomTransformations class responding to to_proc, or just a big hairy proc doing everything.

For my own code, I am playing in my head with idea of extending hm to produce proc objects, then it will allow doing roughly this (OpenWeatherMap example):

TRANSFORM = 
  Hm()
    .transform_values('weather', &:first)
    .transform_values('dt', %w[sys sunrise], %w[sys sunset], &Time.method(:at))
    .except('dt_txt') # ... and so on; returns #to_proc-able object

# for singular value:
transform &TRANSFORM
# for values in a list: transform each of list items
transform &Hm().tranform_values(['list', :*], &TRANSFORM)

# I can imagine different ways of data control in this approach, like...
Hm()
  .expect('weather', Array) # fails if there is no 'weather' key, or pattern === value does not match)

This way next version of TLAW will loose those small nice opinionated "flattening" features, but I am almost sure that nobody have ever used or liked them, except my demo files :)

WDYT?..

joelvh commented 6 years ago

@zverok now we're talkin'! :-)

Yes, this is what I had in mind -- a generic way to "hook" into aspects of a pipelilne. Specifically with TLAW, the transform { ... } block is a hook that can be defined by any kind of transformation library. It should be as generic as possible, to allow for implementing anything.

My thought is that you could specify a thing that responds to call and the result is passed to it (transform MyTransformer).

Alternatively, you can "register" transformations into some sort of registry and refer to them by name (transform :my_transformer or transform(with: :hm) { transform_values('weather', &:first) }).

There could be a way to register a transformer that can be instance_exec'd so that it's the context of the transform hook. It could be an interface we define for the transformers that get defined or a wrapper or something. But basically, I like the idea of some sort of block that has a specific behavior so you can call methods within some specific context.

This is very much what some aspects of ROM do, and ROM or DRY could be some components we use behind the scenes.

joelvh commented 6 years ago

Hey @zverok we wanna revisit getting this merged?

zverok commented 6 years ago

Hey @joelvh, sorry... I feel bad about this, honestly. At some point I somehow got distracted from this discussion. But from my perspective, "the ball is on your side here" (so, initially I thought you'll act somehow on result of our discussion... and then I forgot to clarify it, sorry again). The PR and discussion have got definitely too big to merge it "as is". WDYT about this way:

?