Reference request: distributed record and replay in the context of rr

deliciouslytyped commented 3 years ago

rr brings me joy. rr is also quite hard to search for. There appear to be academic articles for record and replay in distributed systems (for example for HPC). Has anyone looked into this direction for rr?

I see no reason why one couldn't just run rr record for every node, but I imagine that the information for synchronizing multiple debugger sessions doesn't exist in a recording. I don't know if there would be any other issues?

and has anyone thought about implementing such a thing?

rocallahan commented 3 years ago

I'm not aware of anyone working on this with rr.

I've thought about it a little bit. If all you have is a regular rr trace using Linux syscalls, then I think you need some heuristics to match up socket senders and receivers (e.g. using port numbers), and then with some more assumptions you can match up messages. That lets you build a happens-before graph, from which you can build a plausible total order for rr events across all traces. If syscallbuf is enabled you probably need to do analysis that requires replaying all traces and capturing parameters to buffered socket syscalls. (In Pernosco we already have that analysis.) This seems fairly doable if you're willing to live with occasional failures due to bad luck in port numbering.

If you are working with some HPC framework like MPI and you guarantee that your traces represent all processes in the application then I guess that makes the problem a lot easier, but I'm not very well informed about any of these frameworks.

Keno commented 3 years ago

We've done a fair a mount of this and run rr up to maybe a few thousand processes in an HPC setting, but I don't really have a writeup or code for you. Is there something in particular you're wondering about?

rocallahan commented 3 years ago

I guess the question is, what kind of debugging experience do you provide other than just "replay individual nodes"?

Ideally I think there could be a unified debugging experience of some kind, e.g. that tracks data flow across the network.

Keno commented 3 years ago

Back when we were still using our own rr frontend (which has since bitrotted, but I'm hoping to bring back eventually), we had a feature to "ride along" rpcs and and other network messages (basically, it'd dump you in the remote process at the point where the message was received that you just sent). That capability was super fun.

Keno commented 3 years ago

I remembered I did start on a writeup of this kind a thing a few years ago, but never finished it, but here's one of the figures that I had planned on using there (basic client/server example - I think maybe redis with 9 clients - don't quite remember):

Not sure that'll help you, but that's basically the basic workflow. You identify messages between the traces, and then you use that as a basis. I guess an rr-aware system, might try to do more robust message tagging, to make sure this is possible. In the HPC context things are a bit different of course.

rocallahan commented 3 years ago

Did you leverage knowledge of your specific RPC system to match up the messages in that case?

Keno commented 3 years ago

The figure for the paper was a generic "we know which trace is from which IP", and then tracking the fds from the connect syscalls and then matching up read/write syscalls, verifying that message content actually matched (the next step would have been to try to adjust rr to not actually record any message content for read/writes between recorded processes to keep trace size low). For HPC/MPI, you do basically need some coordination between the MPI system and rr, because these things usually rely on mapping the hardware directly into user space, and if you don't do that, your performance tanks completely, because you fall back to some slow ethernet network or whatever that is absolutely not designed for the traffic. It seemed quite feasible to do a syscallbuffer like thing with the command queues though. The message tagging problem might be easier though, because the hardware will generally tag it for you. I spent a few weeks doing that back in 2017 for Cray's interconnect system, but never got it fully working before we concluded that project. It's something I'll probably go back to at some point.

deliciouslytyped commented 3 years ago

You'd leave magic like this bitrotting and hidden from the world somewhere? :P

To be clear, I didn't mean to limit "distributed rr" to the HPC context, that was just an example - not that anyone necessarily assumed so. Ideally a generalized solution to networking things would be possible. - well really, it's a sideeffect like any other, the question is "just" navigation convenience? (point 2)

I don't know much about distsys, or how logical clocks work so Im making very hopeful assumptions here and none of this may make sense - but if you will permit me to bikeshed after thinking about it for about 5 minutes: is there anything stopping rr from adding some instrumentation to wrap packets and bundle a logical clock stamp? Once you have that, you'd index the recording and get a global ordering?

Once you have that synchronization you can at least move around between debug contexts, but that by itself will leave it to you to figure out who is getting messages.

For figuring out who is talking to who, you could hash every incoming and outgoing message and build an index again, and resolve who gets messages by content. Naive hashing would of course result in collisions. Though I also don't know if packet splitting and merging could mess up packet boundaries? Does going higher in the network stack help there?

If technically possible it could be neat to dump pcaps.

Also I imagined this done such that you'd aggregate the recordings on one machine and replay there - how did you do it? Is this perhaps unnecessary or limiting in some way? - well, I suppose if you have a heterogeneous machine base you'd probably need to have replays on their source nodes..
Is it possible to leave nodes out of a recording and still get a usable recording? I don't see why not. Edit: Well, I'd been assuming one would want to record all network data; re: the next step would have been to try to adjust rr to not actually record any message content for read/writes between recorded processes to keep trace size low). The best of both seems possible if you are able to decide what is and isn't an internal communication.

And I'm off to bed for today. I should probably read some of those articles I mentioned. Well, really the only thing I have queued up right now is a 2018 review article: https://superfri.org/superfri/article/view/161 Record-and-Replay Techniques for HPC Systems: A Survey Hardware issues and logical clocks appear to be mentioned on some level.

rocallahan commented 3 years ago

Transparently adding metadata to messages sounds difficult to me. For one thing it only really works if every participant is being recorded. Seems to me you're better off leveraging whatever data is already available, if at all possible.

deliciouslytyped commented 3 years ago

Ok, I haven't finished the review article, but the topics it touches match almost perfectly all of the points raised so far. Definitely worth a look.

To clarify my personal interest a bit, the goal would be to be able to handle highly heterogeneous systems like web servers or whatever. (It would also be neat if there was a --chaos mode for that too somehow.). I.e. letting normal developers sanely debug their distributed stuff.

How necessary / would a central coordinating server be necessary under any circumstances?

I don't know how much of the networking infrastructure we would need to be able to see, but using some kind of virtual overlay infrastructure like a vpn that handles almost everything internally could be interesting for getting more insight into the network?

@rocallahan good point. You'd have to be able to decide whats going to rr nodes and what isn't. Well...you could instrument the connect() and listen calls? :P I guess to understand the needs and possibilities I'd have to have a deeper understanding here too. One could "of course" also manage this stuff out of band on a separate connection but that sounds like it could be complicated too. The reason I'm reluctant about the watching tcp connections stuff and whatnot is that could be unreliable? This needs substantiation, but I don't know, would stuff like NAT mess up the analysis? (do both sides need to be able to identify a specific pipe?) - and UDP is stateless anyway - I don't even know about what could be the case with other layer 3 (?) protocols.

rocallahan commented 3 years ago

How necessary / would a central coordinating server be necessary under any circumstances?

Doesn't seem necessary to me.

I don't know how much of the networking infrastructure we would need to be able to see, but using some kind of virtual overlay infrastructure like a vpn that handles almost everything internally could be interesting for getting more insight into the network?

I suppose that might be useful but I'd want to avoid that if possible. It would add a significant barrier to usage.

The reason I'm reluctant about the watching tcp connections stuff and whatnot is that could be unreliable? This needs substantiation, but I don't know, would stuff like NAT mess up the analysis? (do both sides need to be able to identify a specific pipe?) - and UDP is stateless anyway - I don't even know about what could be the case with other layer 3 (?) protocols.

Yes, NAT would be a problem. UDP is stateless but you get sender and receiver IP address + port so you should be able to match things up ... as long as addresses don't get rewritten, as in NAT.

rocallahan commented 3 years ago

I guess for UDP you get packet reordering, so matching things up would require capturing packets and comparing their contents. Seems like that would work though.

rr-debugger / rr

Reference request: distributed record and replay in the context of rr #2710