haskell / cabal-cache

CI assistant
BSD 3-Clause "New" or "Revised" License
45 stars 10 forks source link

Catch log and rethrow exceptions during download. #198

Closed newhoggy closed 1 year ago

hasufell commented 1 year ago

Will it propagate to the main thread?

newhoggy commented 1 year ago

I'm not sure, but I at least think it will fail the same way as before, but with logging. Once I know the exception, I will catch that exception explicitly and handle it. I'll also ensure failDownload is always called when an exception is called in a future PR, but for now I'd like it to fail as before so it's easy to spot which build is failing.

hasufell commented 1 year ago

Don't think it will fail as before, because we fixed the STM bug.

newhoggy commented 1 year ago

Your testing was without https://github.com/haskell-works/cabal-cache/pull/192?

hasufell commented 1 year ago

Your testing was without https://github.com/haskell-works/cabal-cache/pull/192?

Including.

The crashes were all due to STM deadlock. Since that is gone, I haven't observed it yet.

I think the code will just read the "done" tvar and conclude that there were artifacts that couldn't be downloaded and return zero exit code.

newhoggy commented 1 year ago

I got quite a number of failures with SerializeError:

https://github.com/haskell-works/cabal-cache/actions/runs/3747444801

Looks like something similar to https://github.com/haskell-works/cabal-cache/pull/192 is needed.

newhoggy commented 1 year ago

I'm also curious why https://github.com/haskell-works/cabal-cache/pull/196 wasn't sufficient to prevent the STM deadlock. It seems to me that it should have. So I'm still missing something.

hasufell commented 1 year ago

I think then the matter is a little more complicated and might be in https://github.com/haskell-works/cabal-cache/blob/newhoggy/handle-all-exceptions-from-download-reproducer/src/HaskellWorks/CabalCache/Concurrent/DownloadQueue.hs

Here, multiple threads read and write to the DownloadQueue data. If one thread is writing atomically while another is trying to read, that reading thread will have to retry. But if the writing thread already got GCed once the reading thread is able to actually retry, the STM will consider this a deadlock. At least this is what I understood from those discussions:

One workaround they proposed is using a foreign pointer to avoid GC. The other is to... well, catch the deadlock exception.

newhoggy commented 1 year ago

I think this is why. See:

https://github.com/haskell-works/cabal-cache/blob/newhoggy/handle-all-exceptions-from-download/src/HaskellWorks/CabalCache/Concurrent/DownloadQueue.hs#L43-L51

There is an STM.retry. The deadlock happens when all threads end up calling the STM.retry.

Threads end up in STM.retry when (R.ran dependencies \\ R.dom dependencies \\ failures) is non-empty.

Let's unpack this.

R.ran dependencies \\ R.dom dependencies are the packages that should be downloaded.

R.ran dependencies \\ R.dom dependencies \\ failures are the packages that should be downloaded by haven't failed.

If I fail to mark packages I've failed to download, then this will be non-empty, so we will get a deadlock.

This STM.retry is important because sometimes downloads are in flight. So when that set is non-empty, there should still be a free thread somewhere doing a download. It will either succeed which will remove that package from R.ran dependencies \\ R.dom dependencies or fail which will add that package to failures. Either way the set will become empty.

But if I get an exception, but I don't mark the download as failed then the set cannot become empty and we deadlock.

hasufell commented 1 year ago

That makes sense. So catching and returning DownloadFailure should be the right thing.

newhoggy commented 1 year ago

I put up a new PR and added an explanation: https://github.com/haskell-works/cabal-cache/pull/192

STM deadlock happens in the download queue itself and not in gate that waits for all threads to exit.

But I'm happy that the bug in the thread gate got fixed too.

hasufell commented 1 year ago

This STM.retry is important because sometimes downloads are in flight. So when that set is non-empty, there should still be a free thread somewhere doing a download. It will either succeed which will remove that package from R.ran dependencies \ R.dom dependencies or fail which will add that package to failures. Either way the set will become empty.

I'm not sure I understand the retry.

Thread A enters this retry when it has nothing to download. Thread B may have an on-going download. Whether thread B suceeds to download or fails has no impact on available things to download. Thread A is basically just waiting to return Nothing. It could do that already, without retrying?

newhoggy commented 1 year ago

I have an example:

Suppose you have you have package A and packages B, C & D depend on A and you have 3 threads.

Thread 1 starts downloading and installing A, threads 2 & 3 have to retry because they need to wait until A is downloaded and installed first. That is because if I install B, C or D before A is installed (let's say download of A fails due to timeout) then the cabal store becomes corrupt because B, C or D exists, but they depend on A which doesn't.

Once A is downloaded and installed, it is safe to now download B, C and D, so these become "ready".

Thread 1 spins around and takes B. The other two threads that were previously blocked wake up and download C and D respectively. B, C and D download and install concurrently.

newhoggy commented 1 year ago

The invariant I need to preserve is that no package that I've installed in the cabal store can depend on a package that is not yet installed.

hasufell commented 1 year ago

The invariant I need to preserve is that no package that I've installed in the cabal store can depend on a package that is not yet installed.

Yeah, I think this is what in fact broke with the STM bug. I've had "could not satisfy package-id" errors.