Open MangoIV opened 3 months ago
I think we're failing in GHC.RTS.Events.Binary.getEvent :: EventParsers -> Get (Maybe Event)
where we try to index into the parsers
and instead of getting an EventTypeNum
within range, we get a really big one.
Running ghc-events with an eventlog from the reproducer triggers an assertion failure at this line: https://github.com/haskell/ghc-events/blob/2168f610bf3580a3a4dca7b0582c6384d5433413/src/GHC/RTS/Events/Binary.hs#L758-L765 It seems like the payloadLen is wrong for certain EVENT_PROF_SAMPLE_COST_CENTRE events
jup, the last four parses before it blows up look like this:
evSpec: HeapProfSampleCostCentre {heapProfId = 0, heapProfResidency = 912, heapProfStackDepth = 1, heapProfStack = [78]}, etRef: 163
evSpec: ProfSampleCostCentre {profCapset = 369098752, profTicks = 4333765376, profStackDepth = 0, profCcsStack = []}, etRef: 167
evSpec: HeapProfSampleString {heapProfId = 0, heapProfResidency = 2029359963648, heapProfLabel = "\SO"}, etRef: 164
evSpec: CreateThread {thread = 22784}, etRef: 0
It looks like the parse of ProfSampleCostCentre
introducest the corruption.
@TeofilC is it possible that this is the Ccs stack being too deep for the profStackDepth Word8? And then it overflows and we don't read the Vector
to end?
That doesn't seem to be it, you're right, the payloadLen
before that is already completely off.
Yes on the GHC side we truncate to 255, which should be fine
So afaiu this means that either the parsing of the header on the ghc-events
side or the writing of the header on ghc
RTS side goes wrong. (At least the previous events look fine, so I'm not assuming that they're already introducing the corruption)
Strangely enough I can't seem to reproduce this with -threaded
. I get the impression that an EVENT_PROF_SAMPLE_COST_CENTRE and an EVENT_HEAP_PROF_SAMPLE_COST_CENTRE event being written at the same time is the cause of this. Yet, I'm not sure how that would be possible with the non-threaded runtime!
Also don’t they acquire the global eventBuf lock before writing?
I also can't reproduce if I only do heap profiling or time profiling with the non-threaded RTS.
So it seems like we need to be doing both with the non-threaded RTS.
I think it's highly likely that somehow we try to write both a heap sample and a time sample at the same time to the eventlog
Also don’t they acquire the global eventBuf lock before writing?
Ah but that only exists for the threaded RTS
I think it's highly likely that somehow we try to write both a heap sample and a time sample at the same time to the eventlog
and the -threaded safe guards against that because it does proper locking while the non-threaded RTS doesn’t but is still somehow concurrent? That’s weird
What seems to be happening is:
So we end up with something garbled.
This story is backed up by putting a bunch of traces inside the eventlog printing functions in the RTS. This is the order of events they suggest
So there must be a context switch somewhere in dumpCensus
, even with the non-threaded RTS?
Is it possible that this is happening because the time profile is running asynchronously? (see initTimer
-> initTicker
-> createAttachedOSThread
with handle_tick
-> handleProfTick
-> traceProfSampleCostCentre
) maybe there should be a lock for writing to the eventlog?
so maybe it would work if we'd just keep the ACQUIRE_LOCK
stuff in existence, even in the non-threaded runtime? :eyes:
@TeofilC I don't think that you observation is generally right - I was going to try if I can completely circumvent the issue by using -threaded
or using only one of the two traces but I cannot - with a more elaborate example, which is probably a bit too bulky to share, I still really often (~30% of the time) get this problem.
Is there perhaps some similar issue where the initialisation events (cost centre definitions) are being posted to the output, and that is interupted by a heap profile event before all of the definitions are dumped.
Does it happen if you are not using a profiled executable? (ie, don't compile with -prof
and using -hT -l
)
Another thing to try is a longer profiling interval (-i10
), does that fix the issue?
Interesting @MangoIV . It sounds like there's potentially multiple bugs. In your larger example, maybe you could try to find the last few events before the eventlog gets corrupted. That might help suggest which event is going wrong
I've written up the bug we found here: https://gitlab.haskell.org/ghc/ghc/-/issues/25165 We should keep looking for the other issues, but I wanted to make sure we didn't forget about this one
@mpickering
i10
doesn't fix the issue, this made it so that I got one of the eventlogs through, so I guess it decreases the probability for failure but most still failprofiling-detail
is late
-hT -l
This seems to confirm what @TeofilC suggested about time profiling events interrupting the writing of other events and leading to corruption
it also happens when only using -hy
, no time profiling. Is time profiling still doing things when we only request heap profiling with -l
?
Hi! I have had a couple of problems with
eventlog2html
andhs-speedscope
recently and they seem to be a problem either with the library or the eventlog that the ghc RTS emits. I can more or less (sometimes it doesn't happen) reliably reproduce this error with the following program:compile with
ghc -rtsopts -prof -fprof-late -O0 ./bla.hs
run with./bla +RTS -hc -p -l-au
I have not tried to further reduce the example.
This happens on
ghc
9.6, 9.8 and 9.10 at least, according to @TeofilC onghc-events
HEAD, and then at least onghc-events
0.19.0.1.It affects not only
ghc-events
but alsohs-speedscope
andeventlog2html
.related issue on eventlog2html issue tracker: https://github.com/mpickering/eventlog2html/issues/136