mpiraux / pigotls

A very minimal Go binding for picotls
1 stars 1 forks source link

Context occasionally has a null ALPN vector in Handshake properties #2

Open tiferrei opened 4 years ago

tiferrei commented 4 years ago

Hi there,

I'd like to start by thanking you for the work done on both pigotls and quic-tracker. Both are the most flexible and modular QUIC and TLS implementations I've ever come across.

Some background, I've been trying to adapt quic-tracker to allow me to send QUIC packets on demand, that meaning that I request that it send, say, 3 Initial packets with a crypto frame, and it does just that, waits for the response, and reports back on the results. This relies mostly on the usage of the FrameQueue and the modular agents.

After using this "on demand adapter" for a couple of times, I started seeing that sometimes, C.ptls_handle_message would get called, and in the passed handshake properties, properties.client.negotiated_protocols.list would have a base of a null pointer 0x0, but a length of 4, which would naturally crash the program.

I found this a bit strange, since (I believe) the negotiated protocols in the properties are set on a new connection, and stayed the same over the duration of that connection. Nonetheless, just to be sure I tried hardcoding the ALPN string input on a new Context:

func NewContext(ALPN string, resumptionTicket []byte) Context {
    var ctx C.ptls_context_t
    var handshakeProperties C.ptls_handshake_properties_t
    var savedTicket C.ptls_iovec_t
    var maxEarlyData C.size_t

    var zeroRTTSecret C.ptls_iovec_t
    var hsReadSecret  C.ptls_iovec_t
    var hsWriteSecret C.ptls_iovec_t
    var apReadSecret  C.ptls_iovec_t
    var apWriteSecret C.ptls_iovec_t

    c := Context{
        ctx: &ctx,
        handshakeProperties: &handshakeProperties,
        savedTicket: &savedTicket,
        maxEarlyData: &maxEarlyData,

        zeroRTTSecret: &zeroRTTSecret,
        hsReadSecret: &hsReadSecret,
        hsWriteSecret: &hsWriteSecret,
        apReadSecret: &apReadSecret,
        apWriteSecret: &apWriteSecret,
    }

    C.init_ctx(&ctx)

-       c.alpnVec = toIOVec([]byte(ALPN))
+       c.alpnVec = toIOVec([]byte("hq-29"))
    resumptionTicketVec := toIOVec(resumptionTicket)
    C.set_handshake_properties(c.handshakeProperties, &c.alpnVec, &resumptionTicketVec, c.maxEarlyData)
    C.set_ticket_cb(c.ctx, c.savedTicket)
    C.set_traffic_secret_cb(c.ctx, c.zeroRTTSecret, c.hsReadSecret, c.hsWriteSecret, c.apReadSecret, c.apWriteSecret)

    return c
}

Oddly this didn't seem to make a difference, as it seems like C.set_handshake_properties would always be called with valid parameters, however, for some reason, C.ptls_handle_message would still occasionally be passed properties with a null ALPN vector, and it would change without having C.set_handshake_properties being called.

I "detected" this mostly by printing the value of the vector on C.ptls_handle_message and on C.set_handshake_properties, so we can see the value of the vector change as the functions are called. Oddly I got the following trace:

(...)
[1313e50c956602ed/TLSAgent] tls_agent.go:154: INFO: TLS Queue empty, sending new CHELLO at Initial enc level
[DEBUG C] Base: 0xc0007b2060 Len: 0 // <-- C.ptls_handle_message called, and the vector has a valid pointer base and 0 len.
[1313e50c956602ed/FrameQueueAgent] frame_queue_agent.go:122: Received a CRYPTO frame for encryption level Initial
[CID 1313e50c956602ed] connection.go:152: Sending packet {type=Initial, number=1}
[CID 1313e50c956602ed] connection.go:159: Sent 1252 bytes to UDP socket
[1313e50c956602ed/SocketAgent] socket_agent.go:66: Received 57 bytes from UDP socket
[1313e50c956602ed/ParsingAgent] parse_agent.go:58: Decrypting packet number of Handshake packet of length 57 bytes
[1313e50c956602ed/ParsingAgent] parse_agent.go:83: Successfully decrypted header {type=Handshake, number=3}
[1313e50c956602ed/ParsingAgent] parse_agent.go:140: Successfully parsed packet {type=Handshake, number=3, length=41}
[1313e50c956602ed/TLSAgent] tls_agent.go:76: Received Handshake packet in PN space Handshake and 0 bytes from the corresponding crypto stream
[1313e50c956602ed/HandshakeAgent] handshake_agent.go:86: The connection was closed before the handshake completed
[1313e50c956602ed/HandshakeAgent] handshake_agent.go:88: Agent terminated
[ADAPTER] adapter.go:129: Got response: HANDSHAKE(0xff00001d)[CONNECTION_CLOSE]
[1313e50c956602ed/HandshakeAgent] handshake_agent.go:188: New status HandshakeStatus{Completed=false, Error=the connection was closed before the handshake completed}
[ADAPTER] adapter.go:175: Server input: [INITIAL(0xff00001d)[CRYPTO]]
[ADAPTER] adapter.go:96: Submitting request: INITIAL(0xff00001d)[CRYPTO]
[1313e50c956602ed/TLSAgent] tls_agent.go:154: INFO: TLS Queue empty, sending new CHELLO at Initial enc level
[DEBUG C] Base: 0x0 Len: 4 // <-- C.ptls_handle_message called, but this time the base is a null pointer, and we have a Len of 4.
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7fff71af49c2]

C.set_handshake_properties was not called in between the value changing, as it did not execute its print I coded in. This leads me to believe that something in the middle is changing the properties, and oddly, when this happens its always a null pointer base with length 4.

I'm still trying to figure out what exactly causes this change in value. As far as I can see nothing else should be able to change the negotiated protocols portion of the properties. Would you be able to provide some insight as to why this might be happening?

For further context:

Thank you! Tiago

mpiraux commented 4 years ago

Hi!

I'd like to start by thanking you for the work done on both pigotls and quic-tracker. Both are the most flexible and modular QUIC and TLS implementations I've ever come across.

Thanks a lot! Flexibility has always been a goal, I am happy that you find it suitable for your works.

Some background, I've been trying to adapt quic-tracker to allow me to send QUIC packets on demand, that meaning that I request that it send, say, 3 Initial packets with a crypto frame, and it does just that, waits for the response, and reports back on the results. This relies mostly on the usage of the FrameQueue and the modular agents.

Great! I saw that the other day. Do you envision building a tool like packetdrill ?

First thank you for the deep troubleshooting you already performed. CGO is a very ~painful~ complicated piece. I would have expected the error to be related to simple memory management, so that hardcoding the ALPN would move out the string from the heap to the static memory and solve that. Apparently it didn't.

I changed this part of the code a few commits ago because a similar memory error was happening when receiving a TLS HelloRetryRequest message, see 32d23c495f3ddfe8cbb808b028123a20c1476dc2. I see in the trace that the connection get closed, then your adapter wants to send some more Initial packets. I don't think I've ever tried to generate and send an Initial packet after the connection gets closed. However, when reading the Handshake and TLS agents code, nothing seems to interact with pigotls when a QUIC CONNECTION_CLOSE frame is received.

I am surprised that initial the iovec_t has a length of zero but a valid pointer. The toIOVec() function should set that to a meaningful value. You are printing c.alpnVec inside each HandleMessage() call, right ?

Mmh, at this point I suspect an odd compiler bug, as ptls_handle_message() does not change the handshake_properties. Maybe one way to side-step it would be to store the Go string and reset it first inside each HandleMessage() call. It's not the best solution we could hope for though :/

mpiraux commented 4 years ago

If it's a compiler bug, then it's time for odd suggestions. Could you try preventing SetQUICTransportParameters() from being called more than once ? I guess you don't change the transport parameters so it won't impact you. It's the only function that barely touches the handshake properties that I see.

tiferrei commented 4 years ago

Thank you for such a fast reply, I really do appreciate it!

Great! I saw that the other day. Do you envision building a tool like packetdrill ?

Something like packetdrill for QUIC would be really cool, although I think for my specific use case we might just need a more simple version, our goal is to do some active model learning, and see if we can build a state machine for different implementations, and see how they differ. As part of that, we need to be able to send any kind of packet at any point in the state machine (within what's technically possible), that being "valid" or not, in terms of the protocol rules.

You are printing c.alpnVec inside each HandleMessage() call, right ?

Well for this specific case I actually just edited the built-in Picotls code, added a print statement like so, and recompiled:

int ptls_handle_message(ptls_t *tls, ptls_buffer_t *sendbuf, size_t epoch_offsets[5], size_t in_epoch, const void *input,
                        size_t inlen, ptls_handshake_properties_t *properties)
{
+   fprintf(stderr, "[DEBUG C] Base: %p Len: %i\n", properties->client.negotiated_protocols.list->base, properties->client.negotiated_protocols.list->len);
    return tls->is_server ? ptls_server_handle_message(tls, sendbuf, epoch_offsets, in_epoch, input, inlen, properties)
                          : ptls_client_handle_message(tls, sendbuf, epoch_offsets, in_epoch, input, inlen, properties);
}

Just something really simple to help with debugging, since I couldn't use breakpoints as to not introduce time-dependent behaviour.

Mmh, at this point I suspect an odd compiler bug, as ptls_handle_message() does not change the handshake_properties.

I agree, this might be something more low level like that. Especially since it only happens from time to time. I forgot to mention on this issue, but to reproduce this issue I actually have a script that executes "send 5 Initial packets with crypto frames", on a loop for 100 iterations. With the connection resetting (closing the connection and starting a new one) at the end of each iteration. Sometimes it goes through the 100 iterations without issues, other times the issue happens on the 1st or 5th, it's a bit unpredictable.


I adapted the code to only run SetQUICTransportParameters() once, actually once for the lifetime of that Context object, since I need to be able to reset the connection, and from then the state should be cleared. This seems to have shifted the issue to a further point of the function (progress, I guess!), and now I have the following backtrace:

#0  0x00007feffafffcfc in memcpy () from /lib/ld-musl-x86_64.so.1
#1  0x00000000006fe16b in ptls_buffer__do_pushv (len=3, src=0xa1, buf=0xc0003188c0)
    at /go/src/github.com/tiferrei/pigotls/picotls/lib/picotls.c:539
#2  ptls_buffer__do_pushv (buf=0xc0003188c0, src=0xa1, len=3) at /go/src/github.com/tiferrei/pigotls/picotls/lib/picotls.c:531
#3  0x00000000006fe876 in push_additional_extensions (properties=properties@entry=0xc000222480, sendbuf=sendbuf@entry=0xc0003188c0)
    at /go/src/github.com/tiferrei/pigotls/picotls/lib/picotls.c:1504
#4  0x00000000007091e5 in send_client_hello (tls=tls@entry=0xdb6ce0, emitter=emitter@entry=0x7fefd3d6d960,
    properties=properties@entry=0xc000222480, cookie=cookie@entry=0x0) at /go/src/github.com/tiferrei/pigotls/picotls/lib/picotls.c:1966
#5  0x000000000070b9e2 in ptls_client_handle_message (tls=0xdb6ce0, sendbuf=<optimized out>, epoch_offsets=<optimized out>, in_epoch=0,
    input=<optimized out>, inlen=<optimized out>, properties=0xc000222480) at /go/src/github.com/tiferrei/pigotls/picotls/lib/picotls.c:5321
#6  0x00000000006eac72 in _cgo_9311b982cc12_Cfunc_ptls_handle_message (v=0xc0003c7370) at cgo-gcc-prolog:342
#7  0x0000000000467ce0 in runtime.asmcgocall () at /usr/local/go/src/runtime/asm_amd64.s:655
#8  0x0000000000462f86 in runtime.mallocgc.func1 () at /usr/local/go/src/runtime/malloc.go:1047
#9  0x000000c000316300 in ?? ()
#10 0x000000000043e290 in ?? () at <autogenerated>:1
#11 0x0000000000000000 in ?? ()

If we zoom into frame 3, we see this:

#3  0x00000000006fe876 in push_additional_extensions (properties=properties@entry=0xc000222480, sendbuf=sendbuf@entry=0xc0003188c0)
    at /go/src/github.com/tiferrei/pigotls/picotls/lib/picotls.c:1504
1504                buffer_push_extension(sendbuf, ext->type, { ptls_buffer_pushv(sendbuf, ext->data.base, ext->data.len); });
(gdb) l
1499        int ret;
1500
1501        if (properties != NULL && properties->additional_extensions != NULL) {
1502            ptls_raw_extension_t *ext;
1503            for (ext = properties->additional_extensions; ext->type != UINT16_MAX; ++ext) {
1504                buffer_push_extension(sendbuf, ext->type, { ptls_buffer_pushv(sendbuf, ext->data.base, ext->data.len); });
1505            }
1506        }
1507        ret = 0;
1508    Exit:

Looking at line 1504, if we display *ext, we see the following:

*ext = {type = 65, data = {base = 0xa1 <error: Cannot access memory at address 0xa1>, len = 3}}

This base address is not a null pointer, but it doesn't look like a correct memory address either. Oddly, it is also always base 0xa1 and Len 3. Let's see where this came from. Looking at line 1503, ext is a pointer and points to the additional_extensions, it then gets its address incremented until UINT16_MAX, why it does that, I don't know, it's probably correct, I'm just not that familiar with the implementation. But if we look at our additional_extensions, we see this:

*properties->additional_extensions = {type = 65445, data = {base = 0xdb8080 "\005\004\200", len = 47}}

Where that leaves us.... I'm not too sure.

mpiraux commented 4 years ago

Could you execute your program with valgrind or an equivalent memory checker ? This will give us a hint on the memory corruption source. I wrote most of the pigotls code early in my understanding of Go and C, so we might also catch a subtle bug that went unoticed at that time.

tiferrei commented 4 years ago

Hmm yes, let's see if it finds something odd. I fear it might interfere with the experiment and not just observe it though, since Valgrind is known for delaying execution by quite a while, which might affect the response of the server. Nonetheless it should point us in the right direction. I'll report back once I have some findings.

tiferrei commented 4 years ago

So, running valgrind turned out to not be super helpful since it slowed down execution quite considerably, to the point where I think it might have prevented the issue from happening. In the end it reported a couple hundred bytes as definitely lost, however I don't think that is particularly meaningful.

A surprising twist though – I set the properties again before running handle_message, as you've suggested:

func (c *Connection) HandleMessage(data []byte, epoch Epoch) ([]Message, bool, error) {
    var sendbuf C.ptls_buffer_t
    C.ptls_buffer_init(&sendbuf, unsafe.Pointer(C.CString("")), 0)
    defer C.ptls_buffer_dispose(&sendbuf)

    var recbuf unsafe.Pointer = nil
    var inputLen C.size_t

    if data != nil {
        recbuf = C.CBytes(data)
        defer C.free(recbuf)
        inputLen = sizeofBytes(data)
    }

    var epoch_offsets [5]C.size_t
+   C.set_handshake_properties(c.Context.handshakeProperties, &c.Context.alpnVec, &c.Context.resumptionTicketVec, c.Context.maxEarlyData)
    ret := C.ptls_handle_message(c.tls, &sendbuf, (*C.size_t)(unsafe.Pointer(&epoch_offsets)), C.size_t(epoch), recbuf, inputLen, c.Context.handshakeProperties)

    retBuf := bufToSlice(sendbuf)
    var messages []Message

    for i, e := range Epochs {
        if epoch_offsets[i+1] > epoch_offsets[i] {
            messages = append(messages, Message{retBuf[epoch_offsets[i]:epoch_offsets[i+1]], e})
        }
    }

    if ret != 0 && ret != C.PTLS_ERROR_IN_PROGRESS {
        return nil, false, Error{int(ret)}
    }

    return messages, ret != 0, nil
}

This has, as far as I can see, fixed the issue. I've ran the program for 1000 iterations and not crashes whatsoever, and consistent results too.

To me this tells me there's something wrong going on with the low levels of CGO.

mpiraux commented 4 years ago

I would be ok to merge a fix with this solution. It effectively sounds like the problem lies within CGO. I would have preferred to find the root cause of the bug, but we've spent just enough time investigating to start considering easier alternatives :) Maybe you could restrain this call based on the epoch argument? I guess you only need to call it if its the initial epoch.

tiferrei commented 4 years ago

Hey, an update on this. I tried restraining the call to only the initial epoch, and although it still reduced the occurrence of the issue greatly, it did still happen occasionally after some tries. I'm now confirming if calling it on every handle message (pity I know) would solve the issue permanently.