Closed newhoggy closed 1 year ago
I'm not sure, but I at least think it will fail the same way as before, but with logging. Once I know the exception, I will catch that exception explicitly and handle it. I'll also ensure failDownload
is always called when an exception is called in a future PR, but for now I'd like it to fail as before so it's easy to spot which build is failing.
Don't think it will fail as before, because we fixed the STM bug.
Your testing was without https://github.com/haskell-works/cabal-cache/pull/192?
Your testing was without https://github.com/haskell-works/cabal-cache/pull/192?
Including.
The crashes were all due to STM deadlock. Since that is gone, I haven't observed it yet.
I think the code will just read the "done" tvar and conclude that there were artifacts that couldn't be downloaded and return zero exit code.
I got quite a number of failures with SerializeError
:
https://github.com/haskell-works/cabal-cache/actions/runs/3747444801
Looks like something similar to https://github.com/haskell-works/cabal-cache/pull/192 is needed.
I'm also curious why https://github.com/haskell-works/cabal-cache/pull/196 wasn't sufficient to prevent the STM deadlock. It seems to me that it should have. So I'm still missing something.
I think then the matter is a little more complicated and might be in https://github.com/haskell-works/cabal-cache/blob/newhoggy/handle-all-exceptions-from-download-reproducer/src/HaskellWorks/CabalCache/Concurrent/DownloadQueue.hs
Here, multiple threads read and write to the DownloadQueue
data. If one thread is writing atomically while another is trying to read, that reading thread will have to retry. But if the writing thread already got GCed once the reading thread is able to actually retry, the STM will consider this a deadlock. At least this is what I understood from those discussions:
One workaround they proposed is using a foreign pointer to avoid GC. The other is to... well, catch the deadlock exception.
I think this is why. See:
There is an STM.retry
. The deadlock happens when all threads end up calling the STM.retry
.
Threads end up in STM.retry
when (R.ran dependencies \\ R.dom dependencies \\ failures)
is non-empty.
Let's unpack this.
R.ran dependencies \\ R.dom dependencies
are the packages that should be downloaded.
R.ran dependencies \\ R.dom dependencies \\ failures
are the packages that should be downloaded by haven't failed.
If I fail to mark packages I've failed to download, then this will be non-empty, so we will get a deadlock.
This STM.retry
is important because sometimes downloads are in flight. So when that set is non-empty, there should still be a free thread somewhere doing a download. It will either succeed which will remove that package from R.ran dependencies \\ R.dom dependencies
or fail which will add that package to failures
. Either way the set will become empty.
But if I get an exception, but I don't mark the download as failed then the set cannot become empty and we deadlock.
That makes sense. So catching and returning DownloadFailure
should be the right thing.
I put up a new PR and added an explanation: https://github.com/haskell-works/cabal-cache/pull/192
STM deadlock happens in the download queue itself and not in gate that waits for all threads to exit.
But I'm happy that the bug in the thread gate got fixed too.
This STM.retry is important because sometimes downloads are in flight. So when that set is non-empty, there should still be a free thread somewhere doing a download. It will either succeed which will remove that package from R.ran dependencies \ R.dom dependencies or fail which will add that package to failures. Either way the set will become empty.
I'm not sure I understand the retry.
Thread A enters this retry when it has nothing to download. Thread B may have an on-going download. Whether thread B suceeds to download or fails has no impact on available things to download. Thread A is basically just waiting to return Nothing
. It could do that already, without retrying?
I have an example:
Suppose you have you have package A
and packages B
, C
& D
depend on A
and you have 3
threads.
Thread 1 starts downloading and installing A
, threads 2 & 3 have to retry
because they need to wait until A
is downloaded and installed first. That is because if I install B
, C
or D
before A
is installed (let's say download of A
fails due to timeout) then the cabal store becomes corrupt because B
, C
or D
exists, but they depend on A
which doesn't.
Once A
is downloaded and installed, it is safe to now download B
, C
and D
, so these become "ready".
Thread 1 spins around and takes B
. The other two threads that were previously blocked wake up and download C
and D
respectively. B
, C
and D
download and install concurrently.
The invariant I need to preserve is that no package that I've installed in the cabal store can depend on a package that is not yet installed.
The invariant I need to preserve is that no package that I've installed in the cabal store can depend on a package that is not yet installed.
Yeah, I think this is what in fact broke with the STM bug. I've had "could not satisfy package-id" errors.
Will it propagate to the main thread?