haskell / ghc-events

Library and tool for parsing .eventlog files from GHC
http://www.haskell.org/haskellwiki/ThreadScope
Other
33 stars 34 forks source link

Event parsers and serializers should have roundtrip property #42

Open osa1 opened 5 years ago

osa1 commented 5 years ago

By "roundtrip property" I mean for any event if I serialize it with putEvent, getEvent on the resulting ByteString should give me the original event. #41 shows that we currently don't have this property.

Motivation

I'm trying to improve large eventlog handling of ThreadScope [1]. For this I need some kind of filesystem-backed data structure to be able to serialize unused events [2] to a file and read the ones that are needed. As shown in #41 we currently can't do this.

Problem

The reason why we currently don't have this property is GHC expects that event parsers should be able to parse an event using only a prefix of the serialization of the event, and ghc-events is implemented accordingly. This means that GHC can add more fields to an event as long as (1) it updates the event size in the .eventlog file header (2) it puts the new field at the end of the event. [3] is an example where this happens. Event parsers can then check size of an event using the .eventlog file header, and skip any space between the assumed size of the event and what the header says the event's size is [4].

The problem occurs because the serializer (putEvent and friends) do not take these extra fields into account and serializes events based on their assumed size by ghc-events [5]. So if we try to parse the serialized event using the original parser used to read the .eventlog file the parse fails.

Proposed fix

I propose fixing this by adding a new "extras" field to events:

data Event = Event
  { evTime   :: {-# UNPACK #-}!Timestamp
  , evSpec   :: EventInfo
  , evCap    :: Maybe Int
  , evExtras :: ByteString -- NEW FIELD
  } deriving Show

Parsers can then put the skipped data in this field and serializers can directly serialize that field after serializing the event.

Because this is a breaking change a major version bump will be needed. A major user is ThreadScope, which already has an upper bound <0.9.0 so I don't think this will cause major breakage, and I'll also update ThreadScope so that the next version will be compatible with new ghc-events.


[1]: https://github.com/haskell/ThreadScope/issues/29 [2]: e.g. with the help of an LRU cache [3]: https://phabricator.haskell.org/D3658 [4]: See padParser and its call site for the implementation of this. [5]: In putEventSpec


How does that sound? @maoe I think these days you're maintaining the library, is there anyone else I should ping about this? The implementation should be simple but I just wanted to get some opinions before investing more into it.

osa1 commented 5 years ago

I implemented this and it works nicely. I can parse a 225M eventlog file and for every event check if I can serialize and deserialize it. I'll submit a PR soon.