sustrik / dsock

An obsolete project
Other
82 stars 23 forks source link

[Discussion] Design of dsock and future libdill #5

Open raedwulf opened 8 years ago

raedwulf commented 8 years ago

Hello Martin!

I'm back from my holiday, so I have time (lots of time) to work on libmill/dill related stuff.

While thinking about the layering and composability of protocols, I sketched out some tables which I think would be useful to keep in mind for API design.

dsock/dplumbling? API

Pipeline Object

The API will orient itself around a new object called pipeline. These pipeline objects represent a series of composed protocols that represent the end-to-end communication for a server and client.

Based on the OSI model, we have up to 4 different layers in the pipeline. For example:

Layer 4 Layer 5 Layer 6 Layer 7
TCP/IP TCP/IP - HTTP1.1
TCP/IP TCP/IP TLS HTTP2
UDP DCCP over UDP DTLS Custom App

This is the basic form. All layers are optional; see Multiplexing. Layer 7 is a special case. The structure for `pipeline' can have more than one layer 7 for horizontal composability (protocol switches). For example, the old switching between HTTP1.1 to HTTP2.

Micro-protocol API

Layer 4, Layer 5, and Layer 6 protocols, when composed, have a similar API. Layer 6 dictates what the exact API that Layer 7 has to implement is and will be of one of two forms:

Connection-based

Protocol Functions
Server / Client
Establish listen / connect
Send send
Receive recv
Disestablish close

Connection-less

Protocol Functions
Server / Client
Bind listen / not applicable
Send send
Receive recv
Unbind close / not applicable

Capabilities of Micro-Protocols

Each micro-protocol will require a number of capabilities from the previous protocol in the pipeline and exhibit a set of new capabilities for the next protocol in the pipeline. A protocol may change some of the capabilities e.g. Conn-less to Conn-based, Stream to Datagram, Unreliable to Reliable.

Type Conn-less Conn-based Datagram Stream Reliable Domain Endpoint
Unix IPC Slp+
Pipes IPC Sl
UDP Net IP Slp+
TCP/IP Net IP Slp+
Mq IPC Slp+
Shm * * * * * IPC *

The endpoint field refers to the type of endpoint i.e. what can it implement/provide:

IPCs don't really fit into connection or non-connection based paradigm. There is an address which multiple writers/produces can write to. This is what I mean that they are connection-less. However, it is reasonably trivial to implement a connection-based system over these IPC primitives as well.

Shm - Shared Memory

Custom shared memory implementations for cycle squeezing. It can, in theory, implement all the other protocols in varying degrees. Some implementations are available here. Some results are here.

One benefit of using shared memory is that multicasting can be implemented with very low overhead to other processes.

Multiplexing and IP/IPC conversion

As a special Layer 5/6/7 protocol, multiplexed protocols can be mux'd or demux'd and connected to other pipeline objects. This allows composing across Net IP and IPC protocols.

TODO

I'll update this document as discussion follows. This isn't complete... needed lunch.

sustrik commented 8 years ago

Let me take a step back.

One thing that is troubling me over the years is the lack of progress in the network protocol space. When you look closely on what's going on it looks like only big companies are capable of implementing new protocols. There may be many reasons why random hobby developers aren't developing new protocols the same way they are developing new JavaScript libraries, but one of them is definitely the high cost of developing a network protocol. libmill/dill is ultimately meant to address that problem.

The end goal is thus to allow random Joe developer to hack for a day and come up with a new protocol implementation that's relatively bug free and has relatively good performance.

There are two subgoals required to achieve the above:

  1. Imperative programming style (i.e. no state machines, no callbacks etc.) In other words, the user code should look something like: "read 2 bytes, if x then read 8 bytes, send 24 bytes" and so on. Joe should definitely not implement scheduling ("read 10 bytes; if less than 10 is read, store the state and switch to a different task; in the meantime listen for more data to arrive").
  2. Composability, i.e. don't require everyone to write full-scaled fat protocols. Allow to reuse existing ones and add little pieces to them as needed. This requires much more fine grained layering than the standard OSI model. You may want to write a mini protocol that does just version number handshake. Or one that does bandwidth throttling. A heartbeat protocol. You may want to implement your newly invented congestion control algorithm without caring about other aspects of the protocol. And so on and so on.

W.r.t. composability, it should be noted that two different kinds are needed:

  1. Vertical composability: WebSockets live on top of TCP, which lives on top of IP, which lives on top of Ethernet etc.
  2. Horizontal composability: SSL handshake is followed by capability-handshake-protocol, followed by actual message exchange, followed by connection-termination protocol.
sustrik commented 8 years ago

Now let's have a look at the API design. Some thoughts, in no particular order:

  1. Application protocols are of no interest to the API design. They use the API of the underlying protocol but don't expose such "transport" API of their own. One catch here is that protocols such as SMTP or FTP ar often referred to as app-level, but in fact they are transport protocols. The true application protocols are things like NTP, BGP, SIP and such.
  2. Protocols instances should be represented by file descriptors, or, given that POSIX provides no way to create full-fledged user-space fds, by libdill handles (that's why I added handles to libdill in the first place). This provides to deal with protocols in a virtualised way.
  3. Setup and teardown of the communication is highly divergent among protocols. I believe we shouldn't even try to provide unified API for that. You can see that done in dsock. Each protocol has its own setup/teardown API, but all share the API for passing the data. This is how it's implemented in dsock.
  4. For passing the data we want to have a single API so that protocols are like lego pieces that can be easily combined stacked one on top of another. However, having done multiple attempts at implementing that I now believe we have to have two different APIs, one for bytestream-oriented transports, another one for message-oriented transports. The API for the two is artificially similar, but the semantic distinction between preserving message boundaries and not preserving message boundaries is so big that trying to hide it makes no sense. There are other distinctions (reliable vs. non-reliable, ordered vs. non-ordered) but those typically only make sense in the context of message-based transports. Some thinking is needed whether and how to handle those distinctions. See bscok/msock in dsock library.
  5. That being said, some protocols have additional metadata attached to individual messages ("ancillary data" in BSD socket parlance), which, of course, breaks the uniformity. I believe this can be solved by providing protocol-specific send/recv functions with all the additional stuff but also implementing the unified send/recv functions. The latter would use defaults to fill in ancillary data. See how UDP API looks like in dsock.
  6. Send and recv functions should be atomic. If you ask for 10 bytes you should either get 10 bytes or nothing. Same for sending. Note that this is NOT how the current version of dsock works. There are 3 cases where this becomes a problem: a.) broken connection b.) coroutine cancelation c.) timeout. In the first case there's nothing we can do. We should return an error and close the socket. Cancelation and timeout are more tricky, but after multiple attemps to implement them I believe we should not strive to keep the socket open (and in consistent state) in these cases. We should handle them as connection breakage that just happened to occur on the local side. In other words, the socket should be automatically closed.
  7. Horizontal composability requires the ability to close a protocol without closing the underlying protocol, then re-attaching the underlying protocol to a new higher level protocol. For example, start with SSL-over-TCP, then detach SSL, keep TCP, then re-attch TCP to SMTP. This really means that protocols can't do read-ahead as there's no way to push the read-ahead data back to the socket when detachment is done. And that in turn conflicts with many existing protocols that do require read-ahead. The easiest way to deal with that is simply document that such protocols are not horizontally composable and be done with it.
raedwulf commented 8 years ago

Thanks for the clarifications - I did overlook a number of things in my idealised protocol-world. I was on roughly the same wavelength but it seems I assumed that composing would be helped with some degree of uniformity.

  1. Yes, this was bothering me at the start as I was trying to make a one-size-fits-all API.
  2. I agree.
  3. Yes it is highly divergent but I think there needs to be at least some consistency in error handling and maybe guidance on how setups should be performed. I can't inline code without breaking the numbering so I put it into another section.
  4. I completely agree here. Although when you start stacking protocols, there is always the "conversion" from byte-oriented to a message-oriented API with framing like sframes.
  5. I think my thought patterns were stuck on the lower-down protocols and I didn't think of FTP and SMTP as higher-level transport protocols. Those would be the ones with substantial metadata/ancilliary data that is crucial to the protocol's operation. I think this gives us two classes of metadata: protocol-specific error conditions e.g. SSL/TLS error conditions (which do not map cleanly to errno codes), and protocol-specific data e.g. protocol control switching (FTP commands, data?).
  6. That makes sense, consistent behaviour is good.
  7. This is something I hadn't thought fully about and would definitely play hell with a lot of the API uniformity.

Thanks for the feedback! Hopefully that synchronises our wavelengths a bit better.

Number 3: Protocol Setup

Although I posted that I think a SSL/TLS implementation in libmill is probably not the best idea, I think my code could potentially be useful for dsock in the future.

The issues I encountered with my TLS implementation was that the API initially had:

MILL_EXPORT struct mill_tlssock *mill_tlslisten_(
    struct mill_ipaddr addr,
    const char *cafile, const char *capath,
    void *camem, size_t calen,
    const char *certfile,
    void *certmem, size_t certlen,
    const char *keyfile, void *keymem, size_t keylen,
    const char *password,
    int backlog);

Which was quite burdensome as many fields were optional.

I ended up having a new setup-context structure:

MILL_EXPORT struct mill_tlsctx *mill_tlsserver_(uint32_t flags);
MILL_EXPORT struct mill_tlsctx *mill_tlsclient_(uint32_t flags);
MILL_EXPORT int mill_tlscafile_(struct mill_tlsctx *c, const char *file, const char *path);
MILL_EXPORT int mill_tlscamem_(struct mill_tlsctx *c, void *mem, size_t len);
MILL_EXPORT int mill_tlscertfile_(struct mill_tlsctx *c, const char *file);
MILL_EXPORT int mill_tlscertmem_(struct mill_tlsctx *c, void *mem, size_t len);
MILL_EXPORT int mill_tlskeyfile_(struct mill_tlsctx *c, const char *file, const char *password);
MILL_EXPORT int mill_tlskeymem_(struct mill_tlsctx *c, void *mem, size_t len, const char *password);
MILL_EXPORT const char *mill_tlserror_(struct mill_tlsctx *c);
MILL_EXPORT void mill_tlsfreectx_(struct mill_tlsctx *c);

I was wondering whether there was a more uniform to do this. For instance, maybe providing a structure-based setup with each relevant transport having a custom setup struct rather than a custom set of setup functions.

struct mill_tls_config {
    const char *ca_file;
    const char *ca_path;
    void *ca_mem;
    size_t ca_len;
    const char *cert_file;
    void *cert_mem;
    size_t cert_len;
    const char *key_file;
    void *key_mem;
    size_t key_len;
    const char *key_password;
    uint32_t flags;
};
MILL_EXPORT struct mill_tlssock *mill_tlslisten_(struct mill_tls_config *c, struct mill_ipaddr addr, int backlog);

The structure mill_tls_config is large, so passing it by-value would be very non-standard. Here, mill_tlslisten_ would copy mill_tls_config fields or hand off the values directly to the internal implementation to be processed. EDIT: I just noticed that libdill, unlike libmill, does something similar with ipaddr but it uses ipremote as a constructor? Would that be the recommended way in libdill/dsock-based protocols? Could there be multiple constructor functions for the structure?

Ideal Protocol World Redux

This does make me think dsock should just deal with the micro-protocols and not the actual composition between the protocols (apart from some utility functions).

I am still wondering whether it's possible to make a very tidy API that can compose these protocols, however in a separate optional library on-top e.g. dplumbing. This is for Joe who isn't interested in implementing his own protocol but to stick two or more protocols together with minimal API contact with the lower-level transport protocols and just have the high-level FTP/SMTP/HTTP protocols to worry about.

In terms of composability, would it be correct to think that vertical composability is possible and uniform until it reaches a protocol with protocol-specific ancilliary data (no including error conditions and also under the constraints of compatible types of protocols)?

In other words, composability is viewed from the stand-point on-the-wire. From the protocol standpoint, vertical composability has all the protocols active at any one point. Horizontal composability means that the protocols that are horizontally composed can only be switched between one another. The only exception is multiplexing - which is a special case of horizontal composability that behaves like vertical composability. A diagram would probably be really useful right now...

I just remembered what was bugging me. The current SSL and wsock implementations with libmill wrap the underlying protocols so you can't choose what protocol is used underneath. I was wondering whether it was possible to parameterise what the underlying protocol is. So you could have SSL over unix sockets, pipes etc. without having to substantially change the SSL implementation or other protocol implementation. Is this what the attach and detach functions are for?

sustrik commented 8 years ago

Ok, example is better than words. Here's my idea how to do vertically-composed connection setup:

int h1 = tcp(...); // open tcp connection
int h2 = throttler(h1, ...); // limit the throughput to say 100kB/s
int h3 = crlf(h2, ...); // split lines (turns bytes into messages)
int h4 = multiplexer(h3, ...); // allows for multiple channels within the connection
int h5 = encryptor(h4, ...); // ecrypts the communication
// use h5 to send/recv now

Connection tear-down would be done in similar way but in oposite direction:

int h4 = encryptor_detach(h5);
int h3 = multiplexer_detach(h4);
...

Horizontal composability:

int h1 = tcp(...);
int h2 = wsock(h1, ...);
// exchange of wsock messages
int h1 = wsock_detach(h2);
int h3 = ssl(h1, ...)
// send/recv data
int h1 = ssl_detach(h3);
tcp_close(h1);
sustrik commented 8 years ago

Some minor comments:

  1. Consistent handling of socket setup is hard. Try to come up with common API for TCP (connected), UDP (not connected), SSL (large amount of setup options), SCTP (multihoming) and PGM (multicast). But as I already mentioned, I don't believe we really need uniform API, see the example of vertical composability above.
  2. Messages vs. bytes. It's not like it's always switching from bytestream to messages. Consider IP(messages)->TCP(bytestream)->wsock(messages).
  3. Error codes: For composability you want small amount of error codes that the user can easily check and handle. For debugging you want the most detailed error info possible. The two goals contradict each other. Maybe solving the problems of error handling and debugging using separate means?
  4. Uniformity within vertically composable protocols: That's why I've suggested having 2 APIs per protocol: generic (msock/bsock) and specific (tcp, udp, wsock, etc.) If you want to build your new protocol on top of message-oriented transport use msock. It will work with wsock, udp, sctp, pgm, etc. If you need specific underlying protocol features use that protocol directly.
  5. The important thing here is how to turn specific API into generic API. Consider a multiplexed protocol. When sending message you have to specify channel number. To turn that into msock you need to infer the channel number. Two options that occured to me: 1.) use default channel 2.) provide a function to get channel handle from the protocol handle. Use that handle to send/recv messages.
sustrik commented 8 years ago

Ok, got few hours free, done some work.

  1. I've changed the API in such a way as not to allow partial sends/recvs. Either you transfer all the bytes or none of them.
  2. As an example of vertical layering I've added CRLF protocol (splits bytestream into CRLF-delimited messages) which should work on top of any bytestream protocol (as for now: TCP, UNIX). See tests/crlf.c for how it's used.
sustrik commented 8 years ago

And for a good measure, PFX protocol (messages prefixed by 64-bit size in network byte order).

sustrik commented 8 years ago

And an attempt at an API RFC: https://raw.githubusercontent.com/sustrik/dsock/master/rfc/sock-api-revamp-01.txt

raedwulf commented 8 years ago

I've realised that there's two issues with not allowing partial sends/recvs.

Firstly, there's the minor issue where protocols, like CRLF, are inefficient because they do not know lengths beforehand. So a parsing algorithm cannot parse over the buffer easily without invoking layers of abstraction per byte. This might not be too bad as HTTP header sizes are quite small, but it'll add up if someone wants to implement a HTTP/1.1 server for 1000s of connections.

Secondly, if the protocol is handled by an external library, e.g. OpenSSL, it assumes that the wrapped functions behave in the same way as the UNIX read/write, which do allow partial sends/reads... Of course, this could be emulated - but the number of layers of abstractions traversed per byte would be overwhelming. OpenSSL has around 2 layers of abstractions for it's I/O, and then you have the extra layer of abstraction in dsock.

I've almost completed a preliminary, full-featured implementation that wraps libtls in libdill, needs testing and overcoming this problem. It uses libdill networking as opposed to the linked Libre/OpenSSL library. I'll push a branch later today on my fork, as I'm rebasing the patch on the latest changes.

raedwulf commented 8 years ago

I have two approaches in mind that avoid breaking the semantics of bsend and brecv:

My first approach is a new 'oracle' function:

DSOCK_EXPORT int bwait(int s, size_t *len, int64_t deadline);

Which waits on the socket until there is something to read, where it then returns the amount of data waiting in the buffer. Then subsequent brecv would can use the value of len from bwait.

There is still buffer juggling when layering byte protocols; some byte protocols will need to keep their own buffer if the number of bytes differ to that of the underlying protocol if this function is used.


Another approach would be to allow byte protocols to be chained and the processing done as new data is being read. The chain terminates when it gets converted into a message-based protocol. This would introduce mandatory (EDIT: optional) fields in the bsock_type:

struct bsock_vfs {
    struct bsock_vfs *next;
    int (*bprocessv)(struct bsock_vfs *vfs, const struct iovec *iov, size_t iovlen,
        int64_t deadline);
    int (*bsendv)(struct bsock_vfs *vfs, const struct iovec *iov, size_t iovlen,
        int64_t deadline);
    int (*brecvv)(struct bsock_vfs *vfs, const struct iovec *iov, size_t iovlen,
        int64_t deadline);
};

bprocessv for the first link in the chain is NULL. For subsequent links, bprocessv gets called whenever new data appears. In this case, fd_read would just pass data read to the next layer of protocol without buffering it. Thus, only end links will buffer data that can then be exposed using bsendv/brecvv.

The latter case would likely have better performance...


EDIT: After thinking a bit, bprocessv would not be mandatory, if the protocol does not support it, it can mark bprocessv as NULL and revert to previous brecvv internal behaviour without any external semantic difference.


EDIT: A better way would be to have bprocessv part of an interface and hnext implemented in libdill to get the next handle in the list.

sustrik commented 8 years ago

I've been down both the ways.

The bprocess way leads to callback hell - that's what happened in ZeroMQ - and bwait leads to a state machine hell - see nanomsg.

All in all, it seems that the only manageable way to write network code is to be purely imperative, never do "save state and return to this unfinished operation later" stuff by hand. That's what scheduler is for, after all.

Compare for example the implementation of PFX protocol in dsock:

https://github.com/sustrik/dsock/blob/master/pfx.c

with implementation of the same protocol in nanomsg:

https://github.com/nanomsg/nanomsg/blob/master/src/transports/tcp/stcp.c

As for the two problems you've described:

  1. Unefficient CRLF protocol: I am not sure the performance degradation would be even measurable. Note that TCP protocol buffers received data, so it's not like there is going to be user/kernel-space transition per byte read. That being the case the performance impact is limited to one function call per byte. And one doesn't hear much complaints about cost of function calls since 1970's so I guess it can be considered negligible.
  2. I cannot say much about SSL as I am not really familiar with OpenSSL API. However I can imagine it can be (in the worst possible case) solved by creating a single fat TCP+SSL protocol instead of TCP and SSL protocols.

Actually, the technique of conflating protocols can be used whenever there's need for a super-efficient implementation, e.g. TCP+CRLF can avoid the cost of the extra function call.

raedwulf commented 8 years ago

Thanks! Ouch, yes that does introduce a lot of complexities.

  1. I can do some quick benchmarks to see if that's indeed an issue (an easy test would be a fat TCP+CRLF implementation).
  2. Okay, that makes sense - I'll try opting for the fat TCP+SSL protocol route.
sustrik commented 8 years ago

Ok, I've added a slight optimisation to CRLF protocol. It used to do two virtual function calls per byte -- hquery and brecvv -- and now it does only brecvv. Probably unmeasurable improvement, but still.

If you want to push it even further you may consider adding special optimised code path for reading 1 byte in fd_recv().

raedwulf commented 8 years ago

Would there be any downside with introducing a function like libmill's

size_t tcprecvuntil(tcpsock s, void *buf, size_t len, const char *delims, size_t delimcount, int64_t deadline);

All the non-framed protocols I can think of use text - which is always delimited. So simply an optional optimised implementation of recvuntil would solve most of the potential performance issues. It's optional because it is trivially implemented using recv - which can be the default if recvuntil is implemented.

sustrik commented 8 years ago

I am still not sure about it. The only difference would be whether the function call is done inside the loop or outside of it. It would mean less function calls but given how tight the loop is and that probably all the code and data is in L1 cache, it hard to tell whether performance impact would be even measurable.

On the other hand, if you introduce tcprecvuntil(), you should make it generic, so it should deal with multibyte terminators (crlf) and multiple terminators (like 0x01 and '|' in FIX protocol). As a thought experiment, imagine that we wanted to be fully generic and specified the terminator as an regexp. Surely, the the cost of regexp would outweight the cost of extra function call... My point is that allowing for specialized delimiter-checking algorithm in the protocol on the top is not only more flexible, but can be also more efficient.

Finally, consider how cheap receiving one byte can be:

int tcprecvv(...) {
    if(iovlen == 1 && iov[0].iov_len == 1 && rxbuf->remaining > 0) {
        iov[0].iov_len[0] = rxbuf->data[rxbuf->pos];
        rxbuf->pos++;
        rxbuf->remaining--;
        return 0;
    }
    ...
}
raedwulf commented 8 years ago

Your point does make sense. I'll use the existing interface and see if I encounter any difficulties. Thanks!