Closed thbar closed 4 years ago
@thbar When you work on this, would you mind adding some examples of how to leverage JRuby or Truffle? I've seen some of your talks/writings, and you and others seem to bring up JRuby at times. Personally, I don't have experience with it, or Java in general, and am hesitant to even bother. I imagine it can really boost performance in workloads that you can partition and process in parallel. I bet the community would really appreciate someone with experience talking about when to choose such a tool, and what it takes to use -- and in particular, what it could unlock for ETL processes. I know I would.
@ttilberg very quick response: at the moment, I just do not use JRuby nor TruffleRuby on current Kiba projects. I have used JRuby recently for interop stuff (e.g. a Kiba job which must tap into a Java API). I have just tried TruffleRuby (& provided feedback & bug report), but it is at the moment much, much slower (5x slower) than MRI for the simple tests I tried it on (although I expect it will improve!). So MRI is really my preferred Kiba platform at the moment!
Also, for a lot of workloads, I've seen that there is often an ordering constraint in the target, or constraint in sources (e.g. pagination with cursors), which make parallelisation hard to achieve in a way or another (but not due to Ruby, just due to constraints on sources/destinations).
This explains why I'm rarely referring or documenting partitioning/processing in parallel at this point!
That said:
parallel_transform
, which is useful not for source/destination parallelisation, but to achieve slow parallel IO such as 1 HTTP request per row, etcThanks for the note. I have no concrete case, I just keep seeing people talk about jruby for performance, and have a bit of FOMO. But it sounds like I shouldn't. Thanks!
@ttilberg unless you're hitting real problems, I would not worry too much - and there are solutions with MRI too!
If you’re after someone to proofread/beta-test the new documentation I’d be happy to help. I’ve only used it once in a small project but I loved the simplicity and power it offered and I can help bring more of a newcomers perspective if that would be helpful.
@matt17r Funny, I was just thinking something similar. Related to your comment, just today I was browsing /r/etl and saw a post that made me think that I should do a better job at advocating for Ruby and Kiba in these areas. Inspired by the /r/etl post (but not really related to it), I wrote a somewhat disjointed blog post detailing my favorite Kiba moves that led me to thinking: "I should really do a better job at collecting these thoughts and try to create materials for new people."
I think the simplicity of Kiba pipelines is surprising and approachable even for people who don't use Ruby. Some folks I work with who are not fluent with Ruby have used it and loved it for some data deliveries. I'd love to see Ruby and Kiba gain more traction as a viable tool.
@matt17r thanks for your offer! I will definitely let you know (here) when I work on this. The newcomers perspective is definitely important!
@ttilberg I read /r/etl regularly and often think the same. Ruby + Kiba is definitely gaining traction, and I'm currently taking important restructurations to my business (including "making time") to ensure I'll be able to promote it more & work on new releases for both the OSS & commercial parts.
Thanks for your support!
(if you read this - please do not work on this - this is already in the works ; this issue is here to communicate with Kiba users)
The patterns of use & recommended implementation guidelines have evolved quite a bit since Kiba v1 was released 4 years ago.
I'm rewriting the documentation from scratch to ensure we have a better newcomer experience and we can encourage the patterns that I've seen work in production.