Open spl opened 9 years ago
@spl Thanks for the report! Also, I see your point about nested mapTasks
, although I'm not sure what to do there. async-pool
is creating all the tasks immediately, but only spawning threads for a few jobs. But as you say, within those jobs is another mapTasks
, which will indeed deadlock waiting for a free slot.
How is this for a solution: Each time we spawn a thread within a task group, we remember the threadId
used for that job. Then, anytime a new job is enqueued for the same task group, and we see that it's being enqueued from within that threadId, and that we will be waiting on the result, we allow those jobs to additionally execute within that thread (at most serially, but if slots become free they can run concurrently again).
I'm not sure exactly how this will work in the code, but we do need some way for the blocked worker thread to make progress in these situations. I'll create a test to reproduce the deadlock and then see what I can do.
Thanks for trying async-pool
! If you think it's ready for prime-time, I was thinking of announcing it to Haskell-Cafe this weekend.
As for mapTasks_
: The main difference here is that it does not wait before returning. Perhaps this is making it appear as though the tasks aren't running, because they haven't had a chance to yet? mapTasks
does wait, which is perhaps too different of a semantic to justify simply dropping the _
.
To further clarify the deadlock-resolving solution:
When calling wait
on a job, and that job was enqueued within the current threadId
, replace the job in the task group with a thunk waiting on a TMVar
and immediately execute the corresponding IO
action in that thread, putting to the TMVar
on completion.
This way we can commandeer execution to avoid the delay, while not upsetting anyone depending on the result via the task group.
@jwiegley Thanks for the response. I don't have time to answer fully right now, but there are a couple of things I wanted to mention.
I did see the code containing mapTasks_
complete, so it makes sense that that is because it doesn't wait. Each of the tasks forked by mapTasks_
had database updates, and they were not waiting on anything other than free slots. I never saw those database updates. I'm not sure if those threads were somehow killed or what.
I ran the new mapTasks
-based code on a production migration last night. Unfortunately, something made it get stuck in the middle. No idea what it was.
@spl Ok, I really want to get to the bottom of this, so anything I can do to help, let me know.
I tried implementing the deadlock avoidance that I described above, but it is not possible: At the point where the deadlock happens, I don't have access to IO, only STM.
So what about this for an alternate solution: Waiting, within an active task, on a task spawned from that thread throws an exception, with a suggestion to use a secondary task group?
I feel I'm getting a bit lost. There are two independent issues, right?
mapTasks_
semantics and implementationTaskGroup
To keep the discussion focused, I suggest creating a new GitHub issue for the second issue. I think it's much more difficult to resolve. At this point, I would add documentation that functions such as mapTasks
create all tasks immediately, so you do not want to have one of those tasks using the same TaskGroup
.
As for the first issue, were you able to reproduce my problem? If not, I'll try to make a small test case. It should be easy to do using my app+database setup, but I can try to do it without the database, perhaps with a threadDelay
or something.
That said, time to respond to some of your comments:
As for
mapTasks_
: The main difference here is that it does not wait before returning. [...]mapTasks
does wait, which is perhaps too different of a semantic to justify simply dropping the_
.
I do think it is unexpected to have the _
version of a function be so radically different from the non-_
version. I glanced at the implementation of mapTasks_
when I ran into the problem. I was, admittedly, surprised to see that it wasn't mapTasks
with void
, but I didn't dig into it further. If it is the case that mapTasks_
will be so different, it should be documented in bold.
So what about this for an alternate solution: Waiting, within an active task, on a task spawned from that thread throws an exception, with a suggestion to use a secondary task group?
This sounds like some form of deadlock detection. How certain can you be that the exception doesn't get thrown for a “legitimate” wait?
My apologies for confusing this ticket. I'll continue discussing mapTasks_
, but will open a new issue for the deadlock problem.
As for mapTasks_
, a reproducible test case would be a great help. As for the semantic difference, I think we should just underscore it in bold in the documentation for now.
@spl After thinking about it a lot more, I think that mapTasks_ f xs
should be equivalent to void $ mapTasks f xs
. If one wishes to start a group of tasks without waiting for any result, they can use mapM_ async xs
.
That makes sense, and it would be good to document the async
alternative.
Sorry I haven't gotten back to you with the test case. I've been rather busy lately. I've got a deadline next weekend, so I'm busy this weekend as well.
Do you still want the test case for mapTasks_
?
@spl Whenever you have the time, I'd really like to know what's going on. Thanks!
This is a great library, I'd love if this one got fixed. Maybe I can help somehow?
@alvare That would be great! If you could write some tests to demonstrate the exact problem, it shouldn't be too hard to find the issue and fix it.
Unrelated, by Simon Marlow has pointed out a potential race condition that simply can't be fixed -- but we should have tests to demonstrate it, at the very last as documentation. For example, if you have a task group with 1 slot, and a task A that manually depends on B (using an MVar, instead of the dependency logic in async-pool), and A gets started first by the scheduler, we'll deadlock without detection.
The answer in this case is to make use of the dependency graph in async-pool itself, but it does raise the point that manually using blocking constructions can lead to undetected deadlocks. This doesn't happen with regular async, because every thread is allowed to start right away.
@jwiegley well that's easy:
https://gist.github.com/alvare/67c5717affe6bba891f3
Replace mapTasks_
with mapTasks
and it prints the numbers. mapTasks_
is just not waiting for it's tasks to finish, a simple threadDelay
and you see the numbers get printed.
@alvare Thanks for doing that.
@jwiegley I'm sorry I didn't get around to writing something up. I had moved onto other problems and never made time for this one.
Is this problem with mapConcurrently
the same as the one being discussed?
λ> withTaskGroup 2 $ \g -> mapConcurrently g (const $ getCurrentTime) [1, 2, 3]
*** Exception: thread blocked indefinitely in an STM transaction
It looks like the latest version on Github doesn't have this bug anymore (at least the example above definitely works) but I'm not sure that other functions are not buggy.
mapConcurrently
works for me too for the current HEAD
but not 0.9.0.2 on hackage, with a more involved usage pattern. Perhaps worth a bump?
What's the status of the mapTasks_
issue? I have a test program (but not a 'test') https://github.com/jchia/async-pool-test
The expected output is 12
, but the actual output is 2
. It repros every time for me.
Sorry about the vague title. I haven't looked into the cause, but I thought it was significant enough to report.
I have the following code:
This was the first time I tried
async-pool
, so when I ran it, I was quite surprised that thetasks
did not run. I swapped inmapTasks
formapTasks_
:And the
tasks
ran just fine.Since the problem seems to be pretty clearly due to
mapTasks_
, I hope it's an easy fix for you. Otherwise, if you need a test case or something, let me know.