Open joelvh opened 6 years ago
@zverok this is mainly a placeholder PR to get some feedback on the direction of things. This is primarily meant to keep things backward-compatible (except for classes where the namespace was changed, but should not affect users of the DSL).
@zverok your comments have been addressed except for 2 I left comments for.
The Processors
namespace is mostly a placeholder right now.
Regarding your reasoning, I think it all makes sense. In terms of my goals, I would want to keep the current TLAW behavior and DSL in tact. However, the underlying architecture would become more modular and focused on building pipelines (or middleware) to handle processing of HTTP responses. Ideally, the DSL can have the process
(post_process
) helper methods to build up the ETL-like transformations, so out-of-the-box, TLAW has most of what people want for simple situations.
The additions I'm envisioning are not far from what is already there. However, I would want to achieve:
Enumerator
to stream the data to processors while downloading is a goal, but should not be imposed as part of the TLAW default response processor if deemed out-of-scope)I'm developing an "abstract" ETL pipeline project that can encapsulate the workflows and use existing gems under-the-hood to execute parts of the process (e.g. TLAW can be used to create an API client to stream data into an ETL pipeline)
So all that to say that the current TLAW behavior would ideally stay intact or if it makes sense, some default behavior may evolve.
However, the underlying architecture would become more modular and focused on building pipelines (or middleware) to handle processing of HTTP responses.
Yes, that's one of my goals (that I've never had time to fully embrace for this project). So, let's go with it :+1:
BTW, as you've mentioned it, I, in fact, developed Hm somewhat with TLAW in mind, so probably (on some further steps) the possibility could be investigated to do things in "Hm-like" style:
transform('items', :*, 'date', &Date.method(:parse))
# instead of
transform_items('items') do
transform 'date', &Date.method(:parse))
end
(The former now looks "cleaner" for me, though it contradicts open_weather_map example, with WEATHER_PROCESSOR
reused on several levels.)
But that's for future (also, transform_items
maybe should be transform_each
?.. Looks more Ruby, don't remember why I wanted _items
suffix in the first place.)
PS: On a totally unrelated note, yesterday evening younger colleague asked for help with some Magento integration, and we struggled for a few hours with rough SOAP/Savon, and then I googled for ready solution, and suddenly found magento-rb
and everything worked fine just out-of-the-box, and then I was like "Hey, I know this guy!" So, thanks, I guess :)
@zverok Ha! What a funny coincidence that you guys found that Magento project. Glad it was of help!
Regarding multiple PRs: I tried to largely group updates into commits, so I would initially recommend looking through the commits as a means to understand the changes. If that isn't possible, I'll see about multiple PRs, if that's OK?
Regarding transformations: Yes, Hm was the project, and as a matter of fact, it looks like you just started that very recently. Solnic's Transproc is the other one, and both of these have lots of common transformations built-in. My goal is to develop a set of gems that can help with solving various use cases, that can all be used together. Primarily, the goal is to provide DSLs to construct ETL workflows or data migrations, report generation, and other things in an abstract way that easily lets you use the tools you want to use for things like transformations. (For instance, take the TLAW DSL for defining the workflow and transformations. The transformations could be performed by Hm, Transproc, or both.) A bit ambitious, I realize, but that's the goal. We can talk more about this if you're interested... for now I'm using my reporting gem as a workspace for these concepts, and will abstract out various aspects of that project into individual gems for composition for individual use cases (eventually).
Thanks for reviewing things!
@zverok I should note that the initial abstraction in reporting builds on top of Kiba, but Kiba is not a direct dependency. The Pipeline DSL allows for constructing the transformations, and then the job can be run by Kiba (or another framework).
Some examples in the project to illustrate the ideas:
I like how you developed a DSL object model that wraps around the domain object model. I will explore that approach a bit more in order to not mix helper methods with the data model. (As you will see in some of the changes I made to the transforms in TLAW, I moved what I would consider helpers to the DSL namespace in this manner.)
Thanks!
Hey, sorry for being silent for a week, life and work took too much of my attention.
I generally like the notion of "pipelines", and lately, too, become to think that TLAW should fit in some pipelines (therefore, probably requiring to inherit from it is not the best convention possible). My main use case (which led me to "inventing" the library) is the reality project (it is work-in-progress, here is the latest presentation from RubyConf India). It basically is made of "describers" (high-level homogenous wrappers) for external data services. The describer also can be seen as a pipeline (e.g. fetch Wikipedia page → parse it to wiki-AST → fetch some semantic information → wrap it into high-level value objects like "Temperature" or "Geographical coordinate")
TLAW usage there can be seen, for example, here. I have mixed feelings about how it fit in: on the one hand, it was really easy to describe the data source I wanted, on the other, I needed an "internal" class to descend from TLAW, while probably ability to somehow "include" the DSL into the pipeline could be better.
I was out of town and got a chance to look at your links now. I saw Reality
when I looked through your projects a while back and thought it was very interesting. Your slides are also helpful to walk through your thinking. I have some thoughts around similar ideas of what can be built on top of TLAW and pipelines.
Reality
is more designed around entities that can be used across data sources, which is neat, but also means that each data source needs to have various higher level translations to make them work together (presumably your "describers"). I will give more thought to this and how it can be done more generically and if it's in the realm of where I've been thinking of going. However, the way you are using TLAW as the API client and building on it is exactly what I like.
As I think about data sources, transformations and peristence (more related to ETL), I also think a little bit about making it more event-oriented and potentially some parts async. However, I think that is higher-level pipelines than we are currently talking about.
I will play more with TLAW concepts and how some of my goals in the previous comment fit together as I get my head around this again. It sounds like you are open to determining TLAW's place in a system of pipelines.
Let me know if you have any specific thoughts about the relationship between TLAW, pipelines and how to integrate multiple libraries together (e.g. TLAW
for API client, interchangeable Hm
or transproc
for pre-built transformations, etc).
@zverok I've been looking more into ROM.rb and it looks like the HTTP adapter does what we are doing with request and response handling. There is a ROM HTTP adapter and I also found an example of a Faraday ROM HTTP adapter example.
ROM.rb is also very functional and pipeline-based. It seems like it could be a good foundation for ETL as I've been thinking about, and TLAW would be great for customizing the HTTP client/adapter. I'm going to give more thought about if/how ROM fits into things, but I wanted to see what your thoughts are.
@zverok I've recreated commits for the remaining changes on this PR. It's two commits that encompass renaming the post-processor methods and extracts the response processor to a configurable class.
Let me know if you have any specific thoughts about the relationship between TLAW, pipelines and how to integrate multiple libraries together (e.g. TLAW for API client, interchangeable Hmor transproc for pre-built transformations, etc).
I have no clear idea, currently, mostly because I haven't experimented enough. The generic concerns are:
post_process :some_private_method_name
.class SpecializedWeather
def very_concrete_method
search(city).weather(date) # that's private internal TLAW
.yield_self { some very fancy wrapper }
end
private
# all those methods become availabel to instance yet private.
tlaw do
........
end
end
I am not sure it will be good, just an idea.
As about ROM, I believe that they approach the same problem here, but from a bit different angle. Their is mostly "apply repository pattern to everything" (including HTTP, implementing something like Repository-based ActiveResource); mine is mostly utilitarian, some small tool to be chainable into bigger ones. Maybe the approaches can be joined somehow, IDK for sure :)
Thanks for your thoughts @zverok - I think we are pretty aligned in our thinking. I am getting more into the ROM/Dry/Hanami world and will report back about what I see the fit as.
The one thing that I have been thinking about a lot lately is the value of DSLs as a high-level interface to describe what you want your application to do, and making the implementation details configurable. I really like the Container/IoC patterns to make this even more flexible. Similar to how you describe specifying a method name for some custom transformation, that could very well refer to a key in a container -- because I definitely agree with you that post processing via nested blocks can become very unwieldy in non-trivial situations. To break it down, parts of it should probably be their own methods or classes that you compose together. Hence, I really am interested in looking at what can be done by developing a DSL to handle these complexities, which will then build the composition of the pipeline with your various implementation details behind the scenes.
I'll report back with what I find as I get more into those libraries and communities I mentioned.
In the meantime, what are your thoughts about merging this PR? (I had originally aliased methods to make the transformation methods backward-compatible. If you have concerns about the API changes, that is an option. My guess is that this could be a drop-in for you if you add back the aliases.)
Well, let's try this. I mean, if all the examples and docs would show it is clear enough -- I don't mind getting more functional here :) I reviewed the examples where "mutating" post_process
were used, and they all probably can be rewritten in non-mutating style, probably with more clarity even:
# before
post_process { |e|
e['coord'] = Geo::Coord.new(e['coord.lat'], e['coord.lon']) if e['coord.lat'] && e['coord.lon']
}
post_process('coord.lat') { nil }
post_process('coord.lon') { nil }
# after
post_process { |response|
next response unless response.key?('coord.lon') && response.key?('coord.lat')
response.merge(
'coord' => Geo::Coord.new(response.delete('coord.lat'), response.delete('coord.lon'))
)
}
Then, probably the most readable API would be:
transform { |whole_response| should return next part }
transform(key) { |key value| should return new key value }
transform_array(key) do
transform { ... }
end
About the last name: I'm still feeling a bit unsure about it, but neither transform_items
, nor transform_each
, nor transform_map
seems to communicate "what it does" (what are "items"? "each" what? "map" to where?..). _array
is dull yet straight.
I also thought about hm
-style replacement:
transform('list', :*, 'sys.sunrise', &Time.method(:at))
transform('list', :*, 'weather.icon') { |i| "http://openweathermap.org/img/w/#{i}.png" }
...
...but on this "model" example (OpenWeatherMap) it doesn't bring any clarity at all (and "join coord.lon & coord.lat into coord" seems even more cumbersome).
Probably this part is subject to further investigations.
Here are some initial thoughts I had ruminating, but have not yet thought through all use cases.
If we consider a hierarchy of how the response is processed, maybe we can do something that I saw in a ROM video demo (but can't seem to find an example in their docs).
There was an example of two ways to define mappings. First was to use a block with the entity as an argument. The second was a block without an argument where you could call transformation methods that implicitly applied to the entity. The first was for more custom mapping (e.g. mutations or whatever you want) and returning a result. The second was to describe the transformations via the use of the transformation DSL. This is where I see either taking from ROM or implementing some aspect of ROM.
Anyways, maybe the DSL can be something like this, going with the implicit version we've been using in TLAW already:
# Hook into transformation process, responsible for returning the full result
transform do
# This would map over whole response, assuming the whole response
# is an array or just processes the whole response as a single item
map do # item in the array is implicit
rename old_key: :new_key
# change value
map(key) { |value| ... new value ... }
# could have an option to convert to array if it's a single item
map(key, ensure_array: true) { |value| ... new value ... }
# Many more transformation helpers
end
end
# Other custom approach
transform do |response, transformer|
# do stuff to the response, maybe create a new object to copy resulting data to
result = response # do stuff
# use transformer to take advantage of transform DSL
transformer.for result do
# this is the same context as `transform` without arguments
end
end
I realize this example isn't clear about what gets included or excluded in the resulting output (e.g. if keys you don't transform get added or get skipped in output, etc).
Thoughts on this approach, which describes the data hierarchy more?
@zverok I added a second example with more customizability to the previous comment
Hey. Sorry for the late answer :( Mindful responding requires a lot of my attention that I am currently short of. The thing is, I believe, that good new or updated transformations DSL hardly can be "guessed". I committed some examples (I've had them for long in gitignored folder) to examples/experimental
, and with preexisting examples/*.rb
it is some material to experiment with.
What I currently can say, probably: in fact, "rich DSL for hash transformations" is probably out of the scope of TLAW: it is the road that can be walked very far, and in fact, nothing implies that "hash transformations" should be part of "HTTP API wrapper" library. I started with "opinionated" library which transformed everything, but we already discussed that it was probably a false move (unlike endpoints definition API which is flexible and not assumes anything about client code).
So, what I am thinking now in light of this idea, is that probably this should be only one and ultimate interface:
endpoint ... do
transform &anything
end
The only thing that does part of TLAW API defines is that you can specify some transformations -- just with any block-alike object. Now, this object can be produced by transproc
-alike library, or be some MyCustomTransformations
class responding to to_proc
, or just a big hairy proc doing everything.
For my own code, I am playing in my head with idea of extending hm
to produce proc objects, then it will allow doing roughly this (OpenWeatherMap
example):
TRANSFORM =
Hm()
.transform_values('weather', &:first)
.transform_values('dt', %w[sys sunrise], %w[sys sunset], &Time.method(:at))
.except('dt_txt') # ... and so on; returns #to_proc-able object
# for singular value:
transform &TRANSFORM
# for values in a list: transform each of list items
transform &Hm().tranform_values(['list', :*], &TRANSFORM)
# I can imagine different ways of data control in this approach, like...
Hm()
.expect('weather', Array) # fails if there is no 'weather' key, or pattern === value does not match)
This way next version of TLAW will loose those small nice opinionated "flattening" features, but I am almost sure that nobody have ever used or liked them, except my demo files :)
WDYT?..
@zverok now we're talkin'! :-)
Yes, this is what I had in mind -- a generic way to "hook" into aspects of a pipelilne. Specifically with TLAW, the transform { ... }
block is a hook that can be defined by any kind of transformation library. It should be as generic as possible, to allow for implementing anything.
My thought is that you could specify a thing that responds to call
and the result is passed to it (transform MyTransformer
).
Alternatively, you can "register" transformations into some sort of registry and refer to them by name (transform :my_transformer
or transform(with: :hm) { transform_values('weather', &:first) }
).
There could be a way to register a transformer that can be instance_exec
'd so that it's the context of the transform
hook. It could be an interface we define for the transformers that get defined or a wrapper or something. But basically, I like the idea of some sort of block that has a specific behavior so you can call methods within some specific context.
This is very much what some aspects of ROM do, and ROM or DRY could be some components we use behind the scenes.
Hey @zverok we wanna revisit getting this merged?
Hey @joelvh, sorry... I feel bad about this, honestly. At some point I somehow got distracted from this discussion. But from my perspective, "the ball is on your side here" (so, initially I thought you'll act somehow on result of our discussion... and then I forgot to clarify it, sorry again). The PR and discussion have got definitely too big to merge it "as is". WDYT about this way:
transform
method & optional Processors (flatterner, data tabler) to see how they could be chained into the transform
.?
The goal is to extract all processing of the response in order to implement ETL-like post processing and streaming of HTTP requests.
Changes:
Response
objectTLAW::Processors::Base
with basic functionalityTLAW::Processors::ResponseProcessor
with response body parsing into JSON or XML withoutDataTable
flatteningTLAW::Processors::DataTableResponseProcessor
) which builds onResponseProcessor
and flattens data intoDataTable
response_processor
DSL::Transforms
namespaceprocess*
methods totransform*
with method aliases for backwards-compatibilityprocess_replace
withtransform(replace: true)
TLAW::Params
namespaceObject#yield_self
for older versions of Ruby (in place ofObject#derp
)