haskell / ghc-events

Library and tool for parsing .eventlog files from GHC
http://www.haskell.org/haskellwiki/ThreadScope
Other
33 stars 34 forks source link

Bounded memory sorting #32

Open facundominguez opened 6 years ago

facundominguez commented 6 years ago

While writing an analysis tool, the need for sorting the events in an eventlog file arose. The problem with the existing GHC.RTS.Events.sortEvents is that it is not memory bounded, leading to heap exhaustion on large inputs.

How about we provide a memory bounded version?

I ended up writing something similar to this. It splits events per capability to different temporary files, and then merges the events back into a single list.

I could submit a PR if it looks useful to others.

This version is a bit wasteful in that it has to deserialize and serialize all the events, where for the sake of splitting the events, only the capability number and the encoded form of the event would be needed.

Another odd aspect is that I couldn't make it stream when using Data.ByteString.Builder to write the events to the temporary files. Heap would be exhausted the same.

maoe commented 6 years ago

Sorry for the delayed response.

I haven't looked into the patch yet but in general I think external sorting is a useful addition to the API.

osa1 commented 5 years ago

I recently had a similar problem. Loading a 1.3G eventlog uses more than 32G memory. Profiling output shows that most allocations are done by sortEvents and EventParserUtils.getString.

bgamari commented 4 years ago

Indeed, I came here to suggest precisely @facundominguez's patch. Sorting via secondary storage would be great; I routinely run into issues like this when processing eventlogs.

bgamari commented 4 years ago

This is addressed by #61.

TeofilC commented 2 years ago

I feel like it should be possible to read an eventlog in sorted order with constant memory overhead and without an intermediate file. You'd have a cursor into the file for each capability. Each of those cursors would ignore blocks that belong to another capability (by looking at the block marker, which tells us the size of the block). Then you can get a sorted eventlog by just keeping the cursors sorted by timestamps and always taking the minimum, as events are sorted within a block. This would avoid the need for an intermediate file. In essence, this is the same algorithm as merging N sorted linked-lists.

Does that sound reasonable? If so, I can try to implement it at some point when I have time

Mikolaj commented 2 years ago

Hi @TeofilC. That was a long time ago, so perhaps it's changed, but I remember fixing bugs in Threadscope that came from assuming evens for a single capability come sorted. If they still don't, I think the perturbation is probably minimal, one or two positions, but it's worth checking on a few different computers (slow, fast, under load, etc.) and, if needed. be prepared to react. OTOH, perhaps memory deceives me and it was events for threads, not HECs, coming out of order or events that are expected not arriving at all.

TeofilC commented 10 months ago

I never got around to working on this. Since then my thinking has shifted. I'm not really sure if eventlogs need to be sorted at all.

I think most use cases are better solved by having some sort of index of how the different Capabilities' streams are interleaved. This can be produced without having to modify the eventlog at all and allow us to efficiently read the events in sorted order.