haskell / ghc-events

Library and tool for parsing .eventlog files from GHC
http://www.haskell.org/haskellwiki/ThreadScope
Other
33 stars 34 forks source link

Testsuite write-merge failing #14

Open jberthold opened 8 years ago

jberthold commented 8 years ago

The new test suite write-merge fails on master (f5ac7a0) with an index-out-of-bounds error:

Test suite write-merge: RUNNING...
write-merge: Ix{Int}.index: Index (39) out of range ((0,19))
Test suite write-merge: FAIL

(This is with GHC-7.8.4) but does not seem to relate to the GHC version used to compile the library. A quick first analysis indicates that serialiseEventLog is causing the failure for older trace files. It seems as though some assumptions about the admissible events are baked into the code?

GHCI session:

*WriteMerge> testFile "test/queens-ghc-7.0.2.eventlog"
Loading package array-0.5.0.0 ... linking ... done.
Loading package deepseq-1.3.0.2 ... linking ... done.
Loading package bytestring-0.10.4.0 ... linking ... done.
Loading package containers-0.5.5.1 ... linking ... done.
Loading package binary-0.7.1.0 ... linking ... done.
*** Exception: Ix{Int}.index: Index (39) out of range ((0,22))
*WriteMerge> testFile "test/queens-ghc-6.12.1.eventlog"
*** Exception: Ix{Int}.index: Index (39) out of range ((0,19))
*WriteMerge> 

GHC-6.12.3 knew 19 event types, GHC-7.0.2 knew 22, test files produced by newer GHCs contain many more event types, and those files work as expected:

*WriteMerge> testFile "test/mdlLogMPI1.eventlog"
test/mdlLogMPI1.eventlog: success
True
*WriteMerge> testFile "test/782stop.eventlog"
Different output at line 95:
Original:  10678973: cap 0: stopping thread 1 (blocked on black hole owned by thread 2)
New:       10678973: cap 0: stopping thread 1 (blocked reading an MVar)
test/782stop.eventlog: failure re-written file does not match the original
False
*WriteMerge> testFile "test/pre77stop.eventlog"
test/pre77stop.eventlog: success
True
*WriteMerge> testFile "test/783stop.eventlog"
test/783stop.eventlog: success
True
*WriteMerge> testFile "test/mandelbrot-mmc-2011-06-14.eventlog"
*** Exception: eventTypeNum UnknownEvent
*WriteMerge> :q

The failure with GHC-7.8.2 is expected (see code https://github.com/haskell/ghc-events/blob/master/GHC/RTS/Events.hs#L1184 ) . One could probably modify the file header of the written log file (adding the event that identifies GHC-7.8.3) to fix it - the format which is written is the one of 7.8.3.

Unclear what causes the failure with mandelbrot-mmc-2011-06-14.eventlog. This file has many event types, whose numbers are neither consecutive nor ordered in the file, which might be the reason.

kvelicka commented 8 years ago

Hi,

First and foremost, thank you very much for a detailed and extensive description of the issue!

I'm aware of this bug and I initially thought that it was introduced along with the incremental parser that recently got merged into master. However, at the beginning of this year I ran a regression test of sorts - I retrofitted [1] ghc-events-0.4.4.0 with this suite and found the same failures, if I remember correctly. While this says very little about the cause of the issues, it makes me wonder just how long has this functionality been broken for.

I think that your analysis may very well be correct, especially given the currently known issues with varying event types (i.e. headers) being emitted by different GHC versions. However, I haven't gotten around to investigating this more in recent days. I'd like to do so in the future but that is unlikely to happen in the coming couple of months at least. However, if you work is affected by this issue, I'll assign it a higher priority than I would otherwise. In addition, if you're interested in poking around this further, I can either try to provide assistance or would at least make use of your findings when I get around to working on this further.

Last but not least - and completely out of curiosity - you are one of the few people that are showing active interest in ghc-events lately. Are you using it in some significant capacity at your work/projects? If so, I'd be interested in hearing experiences/feature requests/information that may aid the development and the feature roadmap for future releases of the library.

[1] Note that the write-merge test suite was a part of my recent PR, there were no tests other than [a smaller] test-versions previously.

jberthold commented 8 years ago

Hi Karl,

retrofitted [1] ghc-events-0.4.4.0 with this suite and found the same failures, if I remember correctly. While this says very little about the cause of the issues, it makes me wonder just how long has this functionality been broken for.

The put functionality was probably never used extensively, therefore its bug went undetected. Also, events defined once should never change (design principle of ghc-events), but they did in fact change in the past, so there will be failing cases. That said, the root cause is probably an accidental detail in the internals of the putEventLog.

Maybe I can drill in a bit more for this bug, but I have limited time to spend. I am doing this in the evenings - and also need to spend some time updating clients with this latest version. (Also to answer your last question:) I know that you found ghc-events-parallel my earlier drop-in replacement for ghc-events with support for the Parallel Haskell runtime system's events. Now that this is merged in, the two clients for parallel Haskell, Eden-TraceLab and EdenTV (which btw predates ThreadScope by ~4 years) should use the standard ghc-events.

kvelicka commented 8 years ago

I see, consolidating the two libraries would definitely be helpful to the ecosystem.

As for the bug, sounds like one of us will have to look more into it as we get some free time. I hopefully should, in about a moth or two from now.

maoe commented 7 years ago

I just briefly looked at this issue.

At least for the exception thrown for queens-ghc-6.12.1.eventlog the problem happens like this:

  1. In the eventlog, there is the deprecated EVENT_STEAL_SPARK event type whose event ID is 5 in the header.
  2. We parse events whose event type is EVENT_STEAL_SPARK as SparkSteal, which is also used for the newer EVENT_SPARK_STEAL event type.
  3. When serializing the events, we serialize them as EVENT_SPARK_STEAL, whose event ID is 39 which does not exist in the header.
  4. When re-deserializing the events, the deserializer tries to run the parser for event ID 39 and ends up with an index out of range error.

I'm still not sure what the best way to address this. Would we like to replace these deprecated events with the alternatives when deserializing, or restore original events when serializing?

kvelicka commented 7 years ago

I think it's fine to emit the newer alternatives when serialising -- I'm not aware of any other notable consumers of eventlogs besides ghc-events.

In addition, this has been broken literally for years, at this point a solution would be a good solution.

jberthold commented 7 years ago

Thanks @maoe for bumping this. I think the problem is inherent to the property tested by write-merge: Can a round-trip of writing back and re-parsing an eventlog yield the same result? This is problematic in the general case because particular properties of the different generations of GHC writing the event log are encoded in the parser, and this information is not available to the serialiseLog. See https://github.com/haskell/ghc-events/blob/58aefed5f39e0ae1d4bb5f69ee9a334f68bb9aea/src/GHC/RTS/Events/Binary.hs#L881 for the example of the stopThread treatment: I had to drop the special case of GHC-7.8.2 on the floor. The example analysed above is a different case, one where two distinct events in the log generate the same information in the HS data structure. In both cases the log can only be written back one way, so there is no fix for the property...

Just though: If the serialseLog function wrote the header after all events used in the log are known, the resulting log file could probably be parsed without problems, yielding the exact same HS data structure. I am guessing this would work because the array out of bounds originates from the shorter header, and the 7.8.2 special parse for stopThread is activated by a certain header (not tested, just a conjecture for now). Or maybe one could even write a full header of all known events when the log is written?

maoe commented 7 years ago

because particular properties of the different generations of GHC writing the event log are encoded in the parser, and this information is not available to the serialiseLog

I'm not totally sure if this is the case. The set of parsers varies depending on the GHC version as you said but it's constructed from the header content, which is also available in the serializer. I'll take a deeper look into this issue when I have time.

mboes commented 7 years ago

Does anyone even need the write-back-to-disk functionality? It seems like an odd feature to me to be able to write our own event logs when we're not the compiler. If we got rid of it, a lot of code would go away...

maoe commented 7 years ago

@mboes I think the write back functionality can be useful. For example, we can write a tool to slice and/or filter specific events from a huge eventlog file to a smaller chunk. It's like the tcpslice for pcap files.

maoe commented 7 years ago

Also it's used publicly at least in this package: https://hackage.haskell.org/package/linux-perf

kvelicka commented 7 years ago

I've been told that it's also useful for assembling the full eventlog for a run of a distributed program where all the workers would be writing their own eventlogs.

jberthold commented 7 years ago

...assembling the full eventlog for a run of a distributed program where all the workers would be writing their own eventlogs.

You might think... (we did for some time when we built EdenTV), but actually the first thing any log-consuming software will do is take it apart into the separate eventlogs :) That said, I would not remove the functionality, rather think a bit about what the expected behaviour should be and what compromises are acceptable in this.

maoe commented 7 years ago

Sorry, I didn't mean to close this.

osa1 commented 5 years ago

I haven't debugged this myself yet, but my understanding is that this problem occurs because there are multiple ways to parse an event (in other words we map different events to the same Event) but we only have one serializer for any given event. Looking at the original report and @maoe's comment, it seems to me that the problem occurs like this:

So basically when parsing we map more stuff to less stuff and lose information.

Two ways to deal with this:

But before putting more effort into this I'd suggest dropping support for some ancient GHC versions. I don't know what the policy for dropping support and what's the support window, but I can see in the .cabal file that the earliest GHC we support is currently 7.8.4. So perhaps just drop GHC 6 tests, add GHC 7 (or 8) tests, drop GHC 6 parsers etc. Also, this problem will disappear if we drop GHC 6 support and none of the GHC 7 and 8 events have different formats in different GHC versions (I don't know if this is really the case, haven't checked yet).

Any suggestions?