ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
526 stars 369 forks source link

prov/tcp: Reduce protocol overhead by introducing protocol v4 #6864

Open shefty opened 3 years ago

shefty commented 3 years ago

The tcp protocol overhead, particularly when mixed with rxm, can be reduced to increase the message rate for small messages. The tcp provider would need to support the existing protocol for compatibility, so some conversion function would be needed. For example, replace the current bswap_hdr() call with a new convert_hdr() call that handles both byte swapping and converting from v4->v3. Unfortunately, the existing CM protocol fails requests for unknown versions, so there can't be an easy fallback mechanism. I.e. we can't ask for v4, and have the peer return that it only knows v3, so that we can fallback gracefully. It's unlikely apps are checking the protocol field in the ep_attr, so an environment variable may be needed to set whether v3 or v4 is preferred.

These are proposed changes for an updated protocol (from a patch that in progress, but far from ready):

enum {
    TCPX_CM_OP_CONNECT,
    TCPX_CM_OP_ACCEPT,
    TCPX_CM_OP_REJECT,
};

 /* version must be first for compatibility - align with ofi_ctrl_hdr */
struct tcpx_cm_msg_v4 {
    uint8_t version; /* version is set during CM exchange only */
    uint8_t op;
    uint8_t flags;
    uint8_t size;
    uint16_t endian;
    uint16_t error;
    uint8_t data[UINT8_MAX];
};

/* op field:
 * bits 7:6 - reserved
 * bits 5:4 - hdr type (short, standard, extended)
 * bits 3:0 - opcode
 */

/* values limited to 2 bits */
enum {
    TCPX_HDR_SHORT,
    TCPX_HDR_STD,
    TCPX_HDR_EXT,
    /* 1 reserved value available */
    TCPX_HDR_MAX,
};

/* implementation comment:
 * RTS/CTS - ready to send, clear to send, used for rendezvous
 */
/* values limited to 4 bits */
enum {
    TCPX_ACK,
    TCPX_OP_MSG,
    TCPX_OP_MSG_RTS,
    TCPX_OP_MSG_CTS,
    TCPX_OP_MSG_WRITE,
    TCPX_OP_TAG,
    TCPX_OP_TAG_RTS,
    TCPX_OP_TAG_CTS,
    TCPX_OP_TAG_WRITE,
    TCPX_OP_WRITE,
    TCPX_OP_READ_REQ,
    TCPX_OP_READ_RESP,
    TCPX_OP_MAX,
};

static inline uint8_t tcpx_get_hdr_type(uint8_t op)
{
    return op >> 4;
}

static inline void tcpx_set_hdr_op(uint8_t *op, uint8_t type, uint8_t opcode)
{
    assert((type < TCPX_HDR_MAX) && (opcode < TCPX_OP_MAX));
    *op = (type << 4) | opcode;
}

/* flags - 8 bits available */
enum {
    TCPX_ACK_REQ = BIT(0),
    TCPX_CQ_DATA = BIT(1),
};

/* header layout:
 * short, standard, or extended header
 * u64 cq data: if CQ_DATA flag set
 * u64 tag: if op = OP_TAG
 * ofi_rma_iov: if op = WRITE or op = WRITE_REQ
 */

struct tcpx_short_hdr {
    uint8_t op;
    uint8_t flags;
    uint16_t size;
};

struct tcpx_std_hdr {
    uint8_t op;
    uint8_t flags;
    uint8_t hdr_size;
    union {
        uint8_t rsvd;
        uint8_t id; /* debug */
    };
    uint32_t size;
};

struct tcpx_ext_hdr {
    uint8_t op;
    uint8_t flags;
    uint8_t hdr_size;
    union {
        uint8_t rsvd;
        uint8_t id; /* debug */
    };
    uint32_t resv; /* alignment */
    uint64_t size;
};

The base header sizes are 4, 8, and 16 bytes.

tschuett commented 3 years ago

Would it help to create a new provider prov\tcp2, which is strictly not compatible with prov\tcp. Eventually you can deprecate the old one and free yourself of problems with version management. Or call it tcpng and keep it unstable until your are happy with the protocol.

shefty commented 3 years ago

I've considered that, but copy-paste-modify would almost certainly result in increased maintenance costs. The code overlap is too high. Plus, the existing protocol will likely need to be supported for years, similar to how we've never been able to completely rid ourselves of the socket provider.

My intent is that the code be updated such that only the newest protocol is handled, and that the conversion to the v3 protocol is done only when reading or writing to the socket. That would result in a slight overhead when using v3, but also means that optimizations made to the code would be usable with either version. For many apps, switching to only the latest protocol would be trivial. But there are some client-server apps (DAOS, DDN) where it would be more challenging.

My biggest concern is that the original CM protocol was not forward looking, so there's no way I can see to keep the transition hidden from the app in all situations.