Index out of range in `ghc-events show` and dependencies of `ghc-events`

MangoIV commented 3 months ago

Hi! I have had a couple of problems with eventlog2html and hs-speedscope recently and they seem to be a problem either with the library or the eventlog that the ghc RTS emits. I can more or less (sometimes it doesn't happen) reliably reproduce this error with the following program:

module Main where

import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import Data.Foldable (for_)
import Data.Monoid (Endo (..))

largeEndo :: Endo ByteString
largeEndo = Endo \x -> BS.replicate 100 97 <> x

manyEndos :: Endo ByteString -> Endo ByteString
manyEndos = mconcat . replicate 10_000

main :: IO ()
main = for_ [1 .. 30] \_ ->
  BS.putStr $ appEndo (manyEndos largeEndo) mempty

compile with ghc -rtsopts -prof -fprof-late -O0 ./bla.hs run with ./bla +RTS -hc -p -l-au

I have not tried to further reduce the example.

This happens on ghc 9.6, 9.8 and 9.10 at least, according to @TeofilC on ghc-events HEAD, and then at least on ghc-events 0.19.0.1.

It affects not only ghc-events but also hs-speedscope and eventlog2html.

related issue on eventlog2html issue tracker: https://github.com/mpickering/eventlog2html/issues/136

MangoIV commented 3 months ago

I think we're failing in GHC.RTS.Events.Binary.getEvent :: EventParsers -> Get (Maybe Event) where we try to index into the parsers and instead of getting an EventTypeNum within range, we get a really big one.

TeofilC commented 3 months ago

Running ghc-events with an eventlog from the reproducer triggers an assertion failure at this line: https://github.com/haskell/ghc-events/blob/2168f610bf3580a3a4dca7b0582c6384d5433413/src/GHC/RTS/Events/Binary.hs#L758-L765 It seems like the payloadLen is wrong for certain EVENT_PROF_SAMPLE_COST_CENTRE events

MangoIV commented 3 months ago

jup, the last four parses before it blows up look like this:

evSpec: HeapProfSampleCostCentre {heapProfId = 0, heapProfResidency = 912, heapProfStackDepth = 1, heapProfStack = [78]}, etRef: 163
evSpec: ProfSampleCostCentre {profCapset = 369098752, profTicks = 4333765376, profStackDepth = 0, profCcsStack = []}, etRef: 167
evSpec: HeapProfSampleString {heapProfId = 0, heapProfResidency = 2029359963648, heapProfLabel = "\SO"}, etRef: 164
evSpec: CreateThread {thread = 22784}, etRef: 0

It looks like the parse of ProfSampleCostCentre introducest the corruption.

MangoIV commented 3 months ago

@TeofilC is it possible that this is the Ccs stack being too deep for the profStackDepth Word8? And then it overflows and we don't read the Vector to end?

MangoIV commented 3 months ago

That doesn't seem to be it, you're right, the payloadLen before that is already completely off.

TeofilC commented 3 months ago

Yes on the GHC side we truncate to 255, which should be fine

MangoIV commented 3 months ago

So afaiu this means that either the parsing of the header on the ghc-events side or the writing of the header on ghc RTS side goes wrong. (At least the previous events look fine, so I'm not assuming that they're already introducing the corruption)

TeofilC commented 3 months ago

Strangely enough I can't seem to reproduce this with -threaded. I get the impression that an EVENT_PROF_SAMPLE_COST_CENTRE and an EVENT_HEAP_PROF_SAMPLE_COST_CENTRE event being written at the same time is the cause of this. Yet, I'm not sure how that would be possible with the non-threaded runtime!

MangoIV commented 3 months ago

Also don’t they acquire the global eventBuf lock before writing?

TeofilC commented 3 months ago

I also can't reproduce if I only do heap profiling or time profiling with the non-threaded RTS.

So it seems like we need to be doing both with the non-threaded RTS.

I think it's highly likely that somehow we try to write both a heap sample and a time sample at the same time to the eventlog

TeofilC commented 3 months ago

Also don’t they acquire the global eventBuf lock before writing?

Ah but that only exists for the threaded RTS

MangoIV commented 3 months ago

I think it's highly likely that somehow we try to write both a heap sample and a time sample at the same time to the eventlog

and the -threaded safe guards against that because it does proper locking while the non-threaded RTS doesn’t but is still somehow concurrent? That’s weird

TeofilC commented 3 months ago

What seems to be happening is:

we are running a heap profile
we write half of a heap sample event
the time profile timer triggers
we pause what we are doing, and write a time sample event in the middle of our heap event
we unpause and finish writing our heap event

So we end up with something garbled.

This story is backed up by putting a bunch of traces inside the eventlog printing functions in the RTS. This is the order of events they suggest

MangoIV commented 3 months ago

So there must be a context switch somewhere in dumpCensus, even with the non-threaded RTS?

MangoIV commented 3 months ago

Is it possible that this is happening because the time profile is running asynchronously? (see initTimer -> initTicker -> createAttachedOSThread with handle_tick -> handleProfTick -> traceProfSampleCostCentre) maybe there should be a lock for writing to the eventlog?

MangoIV commented 3 months ago

so maybe it would work if we'd just keep the ACQUIRE_LOCK stuff in existence, even in the non-threaded runtime? :eyes:

MangoIV commented 3 months ago

@TeofilC I don't think that you observation is generally right - I was going to try if I can completely circumvent the issue by using -threaded or using only one of the two traces but I cannot - with a more elaborate example, which is probably a bit too bulky to share, I still really often (~30% of the time) get this problem.

mpickering commented 3 months ago

Is there perhaps some similar issue where the initialisation events (cost centre definitions) are being posted to the output, and that is interupted by a heap profile event before all of the definitions are dumped.

Does it happen if you are not using a profiled executable? (ie, don't compile with -prof and using -hT -l)

mpickering commented 3 months ago

Another thing to try is a longer profiling interval (-i10), does that fix the issue?

TeofilC commented 3 months ago

Interesting @MangoIV . It sounds like there's potentially multiple bugs. In your larger example, maybe you could try to find the last few events before the eventlog gets corrupted. That might help suggest which event is going wrong

TeofilC commented 3 months ago

I've written up the bug we found here: https://gitlab.haskell.org/ghc/ghc/-/issues/25165 We should keep looking for the other issues, but I wanted to make sure we didn't forget about this one

MangoIV commented 3 months ago

@mpickering

i10 doesn't fix the issue, this made it so that I got one of the eventlogs through, so I guess it decreases the probability for failure but most still fail
another thing to note is that I'm killing the application in question with keyboard signal, so that may be something to consider, as well
profiling-detail is late
it seems to happen at least a lot less often, if not not at all with -hT -l

mpickering commented 3 months ago

This seems to confirm what @TeofilC suggested about time profiling events interrupting the writing of other events and leading to corruption

MangoIV commented 3 months ago

it also happens when only using -hy, no time profiling. Is time profiling still doing things when we only request heap profiling with -l?

haskell / ghc-events

Index out of range in `ghc-events show` and dependencies of `ghc-events` #109