Closed lowecg closed 1 year ago
Repro project
Thank you for reporting this !
I've managed to reduce it further.
(m/? (m/reduce conj
(m/ap (m/?> (val (m/?> 2 (->> (m/seed (range))
(m/eduction (map (fn [x] (assert (< x 2)) x)))
(m/group-by identity))))))))
This doesn't terminate, it should fail immediately. This is a bug in either group-by
or ap
, related to input failure.
You're very welcome, Leo. Thank you for looking into this so quickly.
Fantastic news that you're able to isolate the issue.
OK, I understand the problem completely now.
First, here is a workaround for your use case - increase maximum parallelism to 3 :
(defn fetch-missing-items [cache-response-flow]
(m/ap (let [[hit? flow] (m/?> 3 (m/group-by #(contains? % :result) cache-response-flow))]
(m/?> (if hit?
flow
(->> flow
(m/eduction (map :key) (partition-all 100))
fetch-item-details
(m/eduction cat (partition-all 25))
store-item-details
(m/eduction cat)))))))
Now, the issue. Currently, when group-by
's input crashes, the error is redirected on output. When the error is consumed, the active group consumers are terminated along with the main process. This ordering of events is problematic when the consumer backpressure prevents transfers (in this case, ?>
when max parallelism has been reached), because then the group consumers are kept alive and never terminate, which stalls the rest of the pipeline indefinitely.
Proposed fix : when group-by
's input crashes, propagate the error on output and terminate all active group consumers immediately (i.e don't wait for the main consumer to transfer the error).
Thanks for the explanation and workaround - I've hammered the loop for the last 30 minutes, and it's holding up perfectly.
Fixed in b.28
I have a bunch of nested flows in a mock setup to simulate a data retrieval process.
The implementation I currently have is an expansion of the discussion on Slack
In essence, what I'm trying to achieve is the following:
The tasks will be calling web services, so I've used the via/blk. Calls to the services are made in parallel using
(m/?> par flow)
The code consistently locks in the same way, which is the thread dump indicates this location (full thread dump below)
Thread stack:
``` 2023-02-19 10:06:09 Full thread dump OpenJDK 64-Bit Server VM (11.0.18+10-LTS mixed mode): Threads class SMR info: _java_thread_list=0x0000600002120120, length=23, elements={ 0x000000013201a800, 0x00000001421c9800, 0x00000001421cc800, 0x00000001421e8000, 0x00000001421eb000, 0x00000001421ec000, 0x0000000133008800, 0x0000000133009000, 0x00000001320d9800, 0x0000000142788800, 0x000000014279f000, 0x0000000142797000, 0x0000000142797800, 0x00000001427a0000, 0x000000014279e000, 0x000000013251b800, 0x000000013103b800, 0x000000013300b800, 0x0000000132a49000, 0x0000000127838000, 0x000000013252a800, 0x0000000130010800, 0x0000000130830000 } "main" #1 prio=5 os_prio=31 cpu=731.99ms elapsed=35.42s tid=0x000000013201a800 nid=0x1303 waiting on condition [0x000000016bb88000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.18/Native Method) - parking to wait for <0x00000003fe417208> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.18/LockSupport.java:194) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.18/AbstractQueuedSynchronizer.java:885) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.18/AbstractQueuedSynchronizer.java:1039) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.18/AbstractQueuedSynchronizer.java:1345) at java.util.concurrent.CountDownLatch.await(java.base@11.0.18/CountDownLatch.java:232) at missionary.impl.Fiber$1$1.I note that the mocking I'm doing heavily uses m/sleep, and there was this issue. The stack trace seems quite different, however.
I'll share a repro project very shortly