Open albertols opened 1 year ago
IMDG = ?
BaseAsyncLookupDoFn
has a type parameter for some cache implementation, would that suffice?
IMDG = ?
BaseAsyncLookupDoFn
has a type parameter for some cache implementation, would that suffice?
Thanks @kellen, I was checking it out too ;) , really interesting indeed, what I meant by IMDG, it is that we are using a Hazelcast cluster on GKE (with some mircoservices), and now prototyping in SCIO a similar data processing app.
I am going through https://github.com/spotify/scio/blob/0eece133a773e7ff85fc9e7fdcad1bfc3593fc1d/scio-test/src/test/scala/com/spotify/scio/transforms/AsyncLookupDoFnTest.scala#L273C33-L273C46, based on BaseAsyncLookupDoFn
I am trying to figure out an implementation: I do not see many examples but https://github.com/spotify/scio/blob/0eece133a773e7ff85fc9e7fdcad1bfc3593fc1d/scio-test/src/test/scala/com/spotify/scio/testing/JobTestTest.scala#L1122, the only concern here would be the sacalabilty, I guess the DataFlow Vertical autoscaling might help.
I was also wondering TTL for each Element, (there would be some uses cases when eviction comes about could be released, e.g: PubSub; with Timers and @OnTime we could have some interesting control / output for each Element.
Cheers
I still don't think I know what "IMDG" means here.
You are correct that BaseAsyncLookupDoFn
could use a better example. You can plug in whatever cache supplier you want, e.g. a com.google.common.cache .Cache
, and have that handle TTL etc
IMDG stands for In-Memory Data Grid
Hi Guys!
I was able to have a functional State & Timer, for duplicates control; we could have "in memory/or with a low latency" duplicate checks of +1M events (just when using GlobalWindows for PubSub events, otherwise I have not been able to keep state among calls with FixedWindows and no Triggers):
thus, we would have not to deal with a cache supplier (+1M it could lead to memory issues) and it could be easier to scale (Vertical autoscale it is only available in DataFlow Prime)
Thanks for reviewing!
P.S.1: OK duplicates flow
P.S.2: I am curently testing in DataFlow with multiple workers
P.S.3: I am facing some race conditions when records are mocked at the same time (e.g: scio-8000000246 and beam-8000000246 ):
it could be elminated with distinct here distinctBykeyUrl but Trigger must be applied: https://github.com/albertols/scio-db/blob/develop/src/main/scala/com/db/myproject/async/SCIOAsyncService.scala#L30
regarding this potential enhancement S & T for BaseAsyncDoFn, this has been published last week https://medium.com/@serna.alberto.eng/avoid-http-requests-duplicates-in-apache-beam-with-scio-a-custom-baseasyncdofn-and-state-and-2c7d63059ab3
hope it helps @kellen , @RustedBones
Feature Request, motivated by BaseAsyncDoFn and KV lookups to avoid duplicates (instead of using Redis, BigTable, Hazelcast IMDG, etc) https://github.com/albertols/scio-db/blob/develop/src/main/scala/com/db/myproject/async/http/state/StateBaseAsyncDoFn.java
protected abstract boolean alreadySent(@StateId("buffer") MapState<InputT, OutputT> buffer, InputT element);
and this should be also abstract: https://github.com/albertols/scio-db/blob/develop/src/main/scala/com/db/myproject/async/http/state/StateBaseAsyncDoFn.java#L166
protected abstract void addIdempotentElementInBuffer(MapState<InputT, OutputT> buffer, InputT input, OutputT output);
NOTE: GlobalWindow is needed (before applying
applyTransform(ParDo.of(new StateAsyncParDoWithAkka(mediationConfig))).map { m =>)
, in order to keepthe state amomng processElementsHappy to hear other alternatives, we are trying to cut IMDG and extenral idempotent lookups? it would be really cool 👍🏼
Thanks SCio Team!!