ooni / minivpn

A minimalistic OpenVPN implementation in Go
GNU General Public License v3.0
36 stars 6 forks source link

Feat/reliability layer #37

Closed ainghazal closed 6 months ago

ainghazal commented 1 year ago

Implementation of the OpenVPN reliability layer.

This is still a WIP since the test coverage has dropped by the changes in this branch.

ainghazal commented 1 year ago

Some notes on the "simple reliability layer" patch

I'm writing a few lines because some time has passed since I started this PR.

The initial goal was to address one of the weaknesses reported by 7asecurity in their audit: the fact that injection of a bogus packet could stall the (openvpn) handshake.

Here I documented some initial thoughts about approximating the failure rate of the reference openvpn implementation - which seems to be dependant on the packet injection rate, but failing for rates that I consider unrealistic for field observations.

However, I kept building on top of that implementation because using the "reliability layer" seemed to provide better stability while testing on some networks.

Now I realize that I was using the need to follow openvpn implementation more closely as an excuse for a long needed refactor. From the commit message in 9946f2a3:

the simple thing to do was to make session an object
owned by an implementation of reliableTransport. I've reused the
reliableUDP implementation in govpn, and I like the simplicity of that
implementation a lot. A lot of our current logic (ackqueue/retries)
needed to move from the tlsTransport minivpn implementation into
reliableTransport.

Part of this refactor is then orthogonal to the problem of the "reliability" layer for the TLS handshake and the control channel, and the improvements perhaps should be considered separatedly.

code smells in the previous implementation

1. several of the methods in the controlHandler were being passed a session object (sendACK, ControlMessage etc).

-func (c *control) SendACK(conn net.Conn, s *session, pid packetID) error {
-   return sendACKFn(conn, s, pid)
+func (c *control) SendACK(conn net.Conn, r *reliableTransport, pid packetID) error {
+   return sendACKFn(conn, r, pid)

Session should only keep state, but on sending packets session was producing side-effects (like incrementing counters etc). After this changeset, session is owned by reliableTransport, which is now consistently used when a session object was passed around before.

2. excessive knowledge in transport.TLSTransport and session.

I think the implementation of tlsModeTransporter under vpn/transport was mixing responsibilities.

Basically tlsTransport had access to session and implemented two different methods (doReadFromConn, doReadFromQueue). Session had a ackQueue (a chan *packet) where packets were temporarily buffered until we had ack'ed all the earlier packets.

The design of a layer above session that is in charge of sending and tracking which packets have been ack-ed seem cleaner now.

Related to this: session.ackQueue was implemented before as a buffered channel of an arbitrary length (100). The capacity is now fixed at reliableRecvCacheSize = 8, and any incoming packet with a delta beyond that will be dropped (which more closely follows openvpn impl).

3. Inability to use different contexts for different parts of the handshake

mux.Handshake would fail all at once. While working on this PR, I tried to separate the timeouts for different stages of the vpn handshake (in vpn/client.go). I think this can be further improved (by making the different periods configurable, instead of hardcoded).

c.emit(EventHandshake)
handshakeCtx, handshakeCancel := context.WithDeadline(
    ctx,
    time.Now().Add(30*time.Second))
defer handshakeCancel()
err = mux.Handshake(handshakeCtx)

smells still unresolved

1. Use of retry loops with no external observability (and probably lack of correctness)

At some point (and driven by the "keep retrying even in the face of bad data" mentality), I added a few for loops that, in retrospective, might not be justified. They probably need to be removed and bubble up any error. Retrying in a request-response pattern might make sense, but at least we should keep track of the number of unexpected answers, since I thin a well-behaved server should just return what we expect.

Failing in these conditions is also an option, probably - tls-auth should give untampering guarantees for the control chan.

In muxer.handshake():

// XXX I don't know how I feel now about this. Sounds messy, especially taking
// into account that Reset() has its own retry loop in which it keeps waiting for 
// a response with the expected len.

for {
    if err := m.Reset(m.conn, m.reliable); err == nil {
        break
    }
}

// XXX I now think this one is plainly wrong, nothing here should go through the loop
for {
    tlsConf, err = initTLSFn(m.reliable.session, certCfg)
    if err != nil {
        logger.Errorf("%w: %s", ErrBadTLSHandshake, err)
        break
    }
    tlsConn, err = newControlChannelTLSConn(m.conn, m.reliable)
        break
    }

[...]

for {
    select {
    case <-timeoutTLS.C:
        // we give up on waiting for the TLS handshake
        return fmt.Errorf("%w: %s", ErrBadTLSHandshake, "timeout")
    default:
        tls, err = tlsHandshakeFn(tlsConn, tlsConf)

        // this retry DOES seem justified
        if err != nil {
            logger.Error(fmt.Errorf("%w: %s", ErrBadTLSHandshake, err).Error())
            continue
        }
    }
    break
}

Another example in muxer.InitDataWithRemoteKey():

    for {
        if err := m.readAndLoadRemoteKey(); err == nil {
            break
        }
    }
[...]
    for {
        if err := m.readPushReply(); err != nil {
                // XXX this seems hacky - the random sleep should be
                        // removed - it's probably coming from some debug session.
            i := rand.Intn(500)
            fmt.Printf("error reading push reply: %v. sleeping: %vms\n", err, i)
            time.Sleep(time.Millisecond * time.Duration(i))
            continue
        }
        break
    }
ainghazal commented 6 months ago

Closing, this PR is superseded by the refactor mentioned in #47; reliability layer work will be done once that is landed.