Open lukego opened 7 years ago
Very nice write-up! AFAIK, branch prediction and parallelism is not what TCP itself can do at protocol level. QUIC (borrowed from http/2?) does have multiple UDP stream underneath a single connection which may provide what you are looking for wrt issuing multiple work requests. But if you are looking for 'single source serving multiple targets', the traditional sockets and way we do 'multiple clients connecting to a single server' may provide that. (from someone who knows very little about CPUs but does know a little bit about protocols. :-))
As a network engineer I appreciate the analogy! Now I know more about computer architecture than I did moments ago.
This is also at a higher layer than TCP, but I can see an analogy between branch prediction and HTTP/2 server push. Without branch prediction/server push, the CPU/client has to wait for the first instruction/request to finish before it knows what to execute/request next. With it, the CPU/client already has the next instruction/response when it needs it.
Multiple execution engines are somewhat similar to multipath routing or (at a smaller scale) multiple rx/tx queues in network cards?
I added hyperthreads which I think is a good fit.
Interesting that people are commenting and also finding analogies to more protocols than just TCP :).
I also reckon there is an analogy between packet size and micro-ops. Some packets are bigger than others (e.g. jumbo frames), and require more time to serialize over the link, while some instructions are also bigger than others (e.g. rep movsb
memory copy), and require more cycles to execute all of their micro-ops.
On execution ports I am tempted to draw an analogy to the TCP congestion window (cwnd). Haswell has eight execution ports, so it can actively compute eight instructions (uops) at the same time, and so we can say it has a hardcoded cwnd=8. However, the original formulation above is a bit lacking by not mentioning the cwnd and making a distinction between packets/instructions that are actively progressing vs ones that are just waiting in the transmit-queue/reservation-station.
Plenty to think about :).
Nice write-up!
Is there any "thing" like a Re-Order Buffer in TCP? Perhaps TCP sequence numbers?
Is there any "thing" like a Re-Order Buffer in TCP?
Yep!
The TCP packet transmission queue seems to be like a Re-Order Buffer. TCP always frees packet buffers in FIFO sequential order, even if the packets may be delivered out of order due to loss/retransmission or reordering along the path. Modern TCPs will track a bunch of state for each packet in the queue e.g. do we believe that it is currently in flight / do we believe it's delivered because we got a preliminary Selective Acknowledgement / do we believe it's lost due to packets later in the sequence being acknowledged first.
(One major theme in TCP is needing to maintain heuristic estimates of which packets are in-flight/delivered/lost to accelerate decisions about when to transmit/retransmit a new packet. Sending too much wastes bandwidth by sending data that does not benefit the recipient e.g. because it is duplicate or because it is dropped at a bottleneck en route. Sending too little data wastes bandwidth due to idleness or causes starvation when competing with more aggressive TCP implementations on other hosts. This requires a bunch of state and clever algorithms that may well have direct parallels deeper in the CPU.)
Have to think about Reservation Station / Re-Order Buffer distinction a bit more carefully.
@hirenp I wonder if this whole model could be expressed in terms of HTTP instead of TCP too :).
Have to think about Reservation Station / Re-Order Buffer distinction a bit more carefully.
Yes, it's kind of tricky. And besides its re-ordering role, the ROB's capacity is also the factor that limits how far ahead the processor can seek for instructions to fetch.
@lukego Look at QUIC. It has some really nifty features. That will probably be a better fit but I am not sure as I don't know much about CPUs. :-)
Here is a dramatic conclusion from an excellent recent paper called Array layouts for comparison based searching by @pkhuong and Pat Morin:
Our conclusion is contrary to our expectations and to previous findings and goes against conventional wisdom, which states that accesses to RAM are slow, and should be minimized. A more accurate statement, that accounts for our findings, is that accesses to RAM have high latency and this latency needs to be mitigated.
This makes intuitive sense if you think like a network engineer: your connection to RAM is a "long fat pipe."
If you are going to recommend Hennessy and Patterson then you should recommend Computer Organization and Design rather than Computer Architecture. Otherwise that would be like recommending Radia Perlman before TCP/IP Network Administration.
Nice! This is helpful for those more versed in computer architecture than TCP as well.
One nit: Hyperthreading is just Intel marketer speak for SMT (Simultaneous Multithreading), so hardware multithreading would be the actual jargon term you mean.
From protocol's perspective, I guess you want to compare network protocol with cache coherency protocol, they have many similarity, CPU focus on execution, network focus on transmission, there is no much to compare.
great read lets have beer! Its the weekend!
Awesome work!
Very interesting analogy! Should we expand TCP/CPU to UDP/GPU or the like?
This is a rough idea for a talk/tutorial. Critique welcome :).
Suppose you are a network engineer and you want to understand how modern x86 CPUs work under the hood. Cache-misses, out-of-order execution, pipelined execution, etc. One approach is to read a big heavy book like Hennessy and Patterson. However, there is also a short-cut.
CPUs are basically networks these days (#15) and their mechanisms all have direct analogues in TCP. In fact, if you have spent time troubleshooting TCP performance problems in wireshark it's entirely likely that you have a more visceral intuition for CPU performance that most software people do.
Here is why CPUs are basically equivalent to TCP senders:
What do you think?
Have an idea for good analogs of branch prediction and dispatching instructions across multiple execution units?