thbar / kiba

Data processing & ETL framework for Ruby
https://www.kiba-etl.org
Other
1.75k stars 87 forks source link

Documentation rewrite #77

Closed thbar closed 4 years ago

thbar commented 5 years ago

(if you read this - please do not work on this - this is already in the works ; this issue is here to communicate with Kiba users)

The patterns of use & recommended implementation guidelines have evolved quite a bit since Kiba v1 was released 4 years ago.

I'm rewriting the documentation from scratch to ensure we have a better newcomer experience and we can encourage the patterns that I've seen work in production.

ttilberg commented 5 years ago

@thbar When you work on this, would you mind adding some examples of how to leverage JRuby or Truffle? I've seen some of your talks/writings, and you and others seem to bring up JRuby at times. Personally, I don't have experience with it, or Java in general, and am hesitant to even bother. I imagine it can really boost performance in workloads that you can partition and process in parallel. I bet the community would really appreciate someone with experience talking about when to choose such a tool, and what it takes to use -- and in particular, what it could unlock for ETL processes. I know I would.

thbar commented 5 years ago

@ttilberg very quick response: at the moment, I just do not use JRuby nor TruffleRuby on current Kiba projects. I have used JRuby recently for interop stuff (e.g. a Kiba job which must tap into a Java API). I have just tried TruffleRuby (& provided feedback & bug report), but it is at the moment much, much slower (5x slower) than MRI for the simple tests I tried it on (although I expect it will improve!). So MRI is really my preferred Kiba platform at the moment!

Also, for a lot of workloads, I've seen that there is often an ordering constraint in the target, or constraint in sources (e.g. pagination with cursors), which make parallelisation hard to achieve in a way or another (but not due to Ruby, just due to constraints on sources/destinations).

This explains why I'm rarely referring or documenting partitioning/processing in parallel at this point!

That said:

ttilberg commented 5 years ago

Thanks for the note. I have no concrete case, I just keep seeing people talk about jruby for performance, and have a bit of FOMO. But it sounds like I shouldn't. Thanks!

thbar commented 5 years ago

@ttilberg unless you're hitting real problems, I would not worry too much - and there are solutions with MRI too!

matt17r commented 5 years ago

If you’re after someone to proofread/beta-test the new documentation I’d be happy to help. I’ve only used it once in a small project but I loved the simplicity and power it offered and I can help bring more of a newcomers perspective if that would be helpful.

ttilberg commented 5 years ago

@matt17r Funny, I was just thinking something similar. Related to your comment, just today I was browsing /r/etl and saw a post that made me think that I should do a better job at advocating for Ruby and Kiba in these areas. Inspired by the /r/etl post (but not really related to it), I wrote a somewhat disjointed blog post detailing my favorite Kiba moves that led me to thinking: "I should really do a better job at collecting these thoughts and try to create materials for new people."

I think the simplicity of Kiba pipelines is surprising and approachable even for people who don't use Ruby. Some folks I work with who are not fluent with Ruby have used it and loved it for some data deliveries. I'd love to see Ruby and Kiba gain more traction as a viable tool.

thbar commented 5 years ago

@matt17r thanks for your offer! I will definitely let you know (here) when I work on this. The newcomers perspective is definitely important!

@ttilberg I read /r/etl regularly and often think the same. Ruby + Kiba is definitely gaining traction, and I'm currently taking important restructurations to my business (including "making time") to ensure I'll be able to promote it more & work on new releases for both the OSS & commercial parts.

Thanks for your support!

thbar commented 4 years ago

v3.0.0 is out, and I rewrote the wiki mostly from scratch to provide updated recommendations. Closing.