Open tiferrei opened 4 years ago
Hi!
I'd like to start by thanking you for the work done on both pigotls and quic-tracker. Both are the most flexible and modular QUIC and TLS implementations I've ever come across.
Thanks a lot! Flexibility has always been a goal, I am happy that you find it suitable for your works.
Some background, I've been trying to adapt quic-tracker to allow me to send QUIC packets on demand, that meaning that I request that it send, say, 3 Initial packets with a crypto frame, and it does just that, waits for the response, and reports back on the results. This relies mostly on the usage of the FrameQueue and the modular agents.
Great! I saw that the other day. Do you envision building a tool like packetdrill ?
First thank you for the deep troubleshooting you already performed. CGO is a very ~painful~ complicated piece. I would have expected the error to be related to simple memory management, so that hardcoding the ALPN would move out the string from the heap to the static memory and solve that. Apparently it didn't.
I changed this part of the code a few commits ago because a similar memory error was happening when receiving a TLS HelloRetryRequest message, see 32d23c495f3ddfe8cbb808b028123a20c1476dc2. I see in the trace that the connection get closed, then your adapter wants to send some more Initial packets. I don't think I've ever tried to generate and send an Initial packet after the connection gets closed. However, when reading the Handshake and TLS agents code, nothing seems to interact with pigotls when a QUIC CONNECTION_CLOSE frame is received.
I am surprised that initial the iovec_t has a length of zero but a valid pointer. The toIOVec()
function should set that to a meaningful value. You are printing c.alpnVec
inside each HandleMessage()
call, right ?
Mmh, at this point I suspect an odd compiler bug, as ptls_handle_message()
does not change the handshake_properties
. Maybe one way to side-step it would be to store the Go string and reset it first inside each HandleMessage()
call. It's not the best solution we could hope for though :/
If it's a compiler bug, then it's time for odd suggestions. Could you try preventing SetQUICTransportParameters()
from being called more than once ? I guess you don't change the transport parameters so it won't impact you. It's the only function that barely touches the handshake properties that I see.
Thank you for such a fast reply, I really do appreciate it!
Great! I saw that the other day. Do you envision building a tool like packetdrill ?
Something like packetdrill for QUIC would be really cool, although I think for my specific use case we might just need a more simple version, our goal is to do some active model learning, and see if we can build a state machine for different implementations, and see how they differ. As part of that, we need to be able to send any kind of packet at any point in the state machine (within what's technically possible), that being "valid" or not, in terms of the protocol rules.
You are printing c.alpnVec inside each HandleMessage() call, right ?
Well for this specific case I actually just edited the built-in Picotls code, added a print statement like so, and recompiled:
int ptls_handle_message(ptls_t *tls, ptls_buffer_t *sendbuf, size_t epoch_offsets[5], size_t in_epoch, const void *input,
size_t inlen, ptls_handshake_properties_t *properties)
{
+ fprintf(stderr, "[DEBUG C] Base: %p Len: %i\n", properties->client.negotiated_protocols.list->base, properties->client.negotiated_protocols.list->len);
return tls->is_server ? ptls_server_handle_message(tls, sendbuf, epoch_offsets, in_epoch, input, inlen, properties)
: ptls_client_handle_message(tls, sendbuf, epoch_offsets, in_epoch, input, inlen, properties);
}
Just something really simple to help with debugging, since I couldn't use breakpoints as to not introduce time-dependent behaviour.
Mmh, at this point I suspect an odd compiler bug, as ptls_handle_message() does not change the handshake_properties.
I agree, this might be something more low level like that. Especially since it only happens from time to time. I forgot to mention on this issue, but to reproduce this issue I actually have a script that executes "send 5 Initial packets with crypto frames", on a loop for 100 iterations. With the connection resetting (closing the connection and starting a new one) at the end of each iteration. Sometimes it goes through the 100 iterations without issues, other times the issue happens on the 1st or 5th, it's a bit unpredictable.
I adapted the code to only run SetQUICTransportParameters()
once, actually once for the lifetime of that Context
object, since I need to be able to reset the connection, and from then the state should be cleared. This seems to have shifted the issue to a further point of the function (progress, I guess!), and now I have the following backtrace:
#0 0x00007feffafffcfc in memcpy () from /lib/ld-musl-x86_64.so.1
#1 0x00000000006fe16b in ptls_buffer__do_pushv (len=3, src=0xa1, buf=0xc0003188c0)
at /go/src/github.com/tiferrei/pigotls/picotls/lib/picotls.c:539
#2 ptls_buffer__do_pushv (buf=0xc0003188c0, src=0xa1, len=3) at /go/src/github.com/tiferrei/pigotls/picotls/lib/picotls.c:531
#3 0x00000000006fe876 in push_additional_extensions (properties=properties@entry=0xc000222480, sendbuf=sendbuf@entry=0xc0003188c0)
at /go/src/github.com/tiferrei/pigotls/picotls/lib/picotls.c:1504
#4 0x00000000007091e5 in send_client_hello (tls=tls@entry=0xdb6ce0, emitter=emitter@entry=0x7fefd3d6d960,
properties=properties@entry=0xc000222480, cookie=cookie@entry=0x0) at /go/src/github.com/tiferrei/pigotls/picotls/lib/picotls.c:1966
#5 0x000000000070b9e2 in ptls_client_handle_message (tls=0xdb6ce0, sendbuf=<optimized out>, epoch_offsets=<optimized out>, in_epoch=0,
input=<optimized out>, inlen=<optimized out>, properties=0xc000222480) at /go/src/github.com/tiferrei/pigotls/picotls/lib/picotls.c:5321
#6 0x00000000006eac72 in _cgo_9311b982cc12_Cfunc_ptls_handle_message (v=0xc0003c7370) at cgo-gcc-prolog:342
#7 0x0000000000467ce0 in runtime.asmcgocall () at /usr/local/go/src/runtime/asm_amd64.s:655
#8 0x0000000000462f86 in runtime.mallocgc.func1 () at /usr/local/go/src/runtime/malloc.go:1047
#9 0x000000c000316300 in ?? ()
#10 0x000000000043e290 in ?? () at <autogenerated>:1
#11 0x0000000000000000 in ?? ()
If we zoom into frame 3, we see this:
#3 0x00000000006fe876 in push_additional_extensions (properties=properties@entry=0xc000222480, sendbuf=sendbuf@entry=0xc0003188c0)
at /go/src/github.com/tiferrei/pigotls/picotls/lib/picotls.c:1504
1504 buffer_push_extension(sendbuf, ext->type, { ptls_buffer_pushv(sendbuf, ext->data.base, ext->data.len); });
(gdb) l
1499 int ret;
1500
1501 if (properties != NULL && properties->additional_extensions != NULL) {
1502 ptls_raw_extension_t *ext;
1503 for (ext = properties->additional_extensions; ext->type != UINT16_MAX; ++ext) {
1504 buffer_push_extension(sendbuf, ext->type, { ptls_buffer_pushv(sendbuf, ext->data.base, ext->data.len); });
1505 }
1506 }
1507 ret = 0;
1508 Exit:
Looking at line 1504
, if we display *ext
, we see the following:
*ext = {type = 65, data = {base = 0xa1 <error: Cannot access memory at address 0xa1>, len = 3}}
This base address is not a null pointer, but it doesn't look like a correct memory address either. Oddly, it is also always base 0xa1
and Len 3
. Let's see where this came from. Looking at line 1503
, ext
is a pointer and points to the additional_extensions
, it then gets its address incremented until UINT16_MAX
, why it does that, I don't know, it's probably correct, I'm just not that familiar with the implementation. But if we look at our additional_extensions
, we see this:
*properties->additional_extensions = {type = 65445, data = {base = 0xdb8080 "\005\004\200", len = 47}}
Where that leaves us.... I'm not too sure.
Could you execute your program with valgrind or an equivalent memory checker ? This will give us a hint on the memory corruption source. I wrote most of the pigotls code early in my understanding of Go and C, so we might also catch a subtle bug that went unoticed at that time.
Hmm yes, let's see if it finds something odd. I fear it might interfere with the experiment and not just observe it though, since Valgrind is known for delaying execution by quite a while, which might affect the response of the server. Nonetheless it should point us in the right direction. I'll report back once I have some findings.
So, running valgrind turned out to not be super helpful since it slowed down execution quite considerably, to the point where I think it might have prevented the issue from happening. In the end it reported a couple hundred bytes as definitely lost, however I don't think that is particularly meaningful.
A surprising twist though – I set the properties again before running handle_message, as you've suggested:
func (c *Connection) HandleMessage(data []byte, epoch Epoch) ([]Message, bool, error) {
var sendbuf C.ptls_buffer_t
C.ptls_buffer_init(&sendbuf, unsafe.Pointer(C.CString("")), 0)
defer C.ptls_buffer_dispose(&sendbuf)
var recbuf unsafe.Pointer = nil
var inputLen C.size_t
if data != nil {
recbuf = C.CBytes(data)
defer C.free(recbuf)
inputLen = sizeofBytes(data)
}
var epoch_offsets [5]C.size_t
+ C.set_handshake_properties(c.Context.handshakeProperties, &c.Context.alpnVec, &c.Context.resumptionTicketVec, c.Context.maxEarlyData)
ret := C.ptls_handle_message(c.tls, &sendbuf, (*C.size_t)(unsafe.Pointer(&epoch_offsets)), C.size_t(epoch), recbuf, inputLen, c.Context.handshakeProperties)
retBuf := bufToSlice(sendbuf)
var messages []Message
for i, e := range Epochs {
if epoch_offsets[i+1] > epoch_offsets[i] {
messages = append(messages, Message{retBuf[epoch_offsets[i]:epoch_offsets[i+1]], e})
}
}
if ret != 0 && ret != C.PTLS_ERROR_IN_PROGRESS {
return nil, false, Error{int(ret)}
}
return messages, ret != 0, nil
}
This has, as far as I can see, fixed the issue. I've ran the program for 1000 iterations and not crashes whatsoever, and consistent results too.
To me this tells me there's something wrong going on with the low levels of CGO.
I would be ok to merge a fix with this solution. It effectively sounds like the problem lies within CGO. I would have preferred to find the root cause of the bug, but we've spent just enough time investigating to start considering easier alternatives :) Maybe you could restrain this call based on the epoch argument? I guess you only need to call it if its the initial epoch.
Hey, an update on this. I tried restraining the call to only the initial epoch, and although it still reduced the occurrence of the issue greatly, it did still happen occasionally after some tries. I'm now confirming if calling it on every handle message (pity I know) would solve the issue permanently.
Hi there,
I'd like to start by thanking you for the work done on both pigotls and quic-tracker. Both are the most flexible and modular QUIC and TLS implementations I've ever come across.
Some background, I've been trying to adapt quic-tracker to allow me to send QUIC packets on demand, that meaning that I request that it send, say, 3 Initial packets with a crypto frame, and it does just that, waits for the response, and reports back on the results. This relies mostly on the usage of the
FrameQueue
and the modular agents.After using this "on demand adapter" for a couple of times, I started seeing that sometimes,
C.ptls_handle_message
would get called, and in the passed handshake properties,properties.client.negotiated_protocols.list
would have abase
of a null pointer0x0
, but a length of4
, which would naturally crash the program.I found this a bit strange, since (I believe) the negotiated protocols in the properties are set on a new connection, and stayed the same over the duration of that connection. Nonetheless, just to be sure I tried hardcoding the ALPN string input on a new
Context
:Oddly this didn't seem to make a difference, as it seems like
C.set_handshake_properties
would always be called with valid parameters, however, for some reason,C.ptls_handle_message
would still occasionally be passed properties with a null ALPN vector, and it would change without havingC.set_handshake_properties
being called.I "detected" this mostly by printing the value of the vector on
C.ptls_handle_message
and onC.set_handshake_properties
, so we can see the value of the vector change as the functions are called. Oddly I got the following trace:C.set_handshake_properties
was not called in between the value changing, as it did not execute its print I coded in. This leads me to believe that something in the middle is changing the properties, and oddly, when this happens its always a null pointer base with length 4.I'm still trying to figure out what exactly causes this change in value. As far as I can see nothing else should be able to change the negotiated protocols portion of the properties. Would you be able to provide some insight as to why this might be happening?
For further context:
Thank you! Tiago