status-im / nim-chronos

Chronos - An efficient library for asynchronous programming
https://status-im.github.io/nim-chronos/
Apache License 2.0
362 stars 51 forks source link

Fix SIGBUS error caused by erroneously calling sslEngineSendappAck #356

Closed iffy closed 1 year ago

iffy commented 1 year ago

I have some code that sometimes ends up in a state where sslEngineSendappAck is being called with length = 0 here https://github.com/status-im/nim-chronos/blob/8fcbe716b2f069480277aa6781782d3e311ee2c0/chronos/streams/tlsstream.nim#L174

It immediately fails with:

/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(1202) waitFor
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(295) poll
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim(177) tlsWriteApp
SIGBUS: Illegal storage access. (Attempt to read from nil?)

The BearSSL docs for br_ssl_engine_sendapp_ack state:

The len parameter MUST NOT be 0, and MUST NOT exceed the value obtained in the br_ssl_engine_sendapp_buf() call.

The BearSSL docs for br_ssl_engine_sendapp_buf state:

If the engine is ready to accept application data to send to the peer, then this call returns a pointer to the buffer where such data shall be written, and its length is written in len. Otherwise, len is set to 0 and NULL is returned.

I've verified that when length == 0 the return value of sslEngineSendappBuf is indeed nil.

I believe the TLSResult.Success value is the right return value in this case, because BearSSL might be ready later to receive the data (and it makes my code work), but I could be wrong on that point.

cheatfate commented 1 year ago

I'm not sure i understand why you think that something was called by an error?

Your stacktrace points into Line 177. https://github.com/status-im/nim-chronos/blob/8fcbe716b2f069480277aa6781782d3e311ee2c0/chronos/streams/tlsstream.nim#L177

But for some reason you blame line 174, and why you think that sslEngineSendappBuf() sends nil if sslEngineCurrentState == SSL_SENDAPP?

iffy commented 1 year ago

I mean that it is making a call to sslEngineSendappAck when sslEngineSendappBuf has returned NULL and it shouldn't do that. The BearSSL docs say not to call sslEngineSendappAck with length 0. I put debugging statements in my copy of chronos and found that sslEngineSendappAck was being called with length 0. So I put more debugging statements in and found that sslEngineSendappBuf was returning nil, which the docs say it will sometimes do. When I put in the fix in this PR, it no longer crashed.

Edit: I think the line numbers got mixed up because of the debugging statements I added.

Do you agree with the BearSSL docs that code should not call sslEngineSendappAck with length 0?

iffy commented 1 year ago

I've removed my patch and reproduced the error with the correct line numbers:

Traceback (most recent call last)
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(1202) waitFor
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(295) poll
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim(174) tlsWriteApp
iffy commented 1 year ago

When I run it with this patch in place:

--- /tmp/tlssstream.nim 2023-02-06 14:40:26.000000000 -0500
+++ /Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim   2023-02-06 14:40:28.000000000 -0500
@@ -157,6 +157,7 @@
     if item.size > 0:
       var length = 0'u
       var buf = sslEngineSendappBuf(engine[], length)
+      echo "DEBUG buf is nil? ", $buf.isNil
       let toWrite = min(int(length), item.size)
       copyOut(buf, item, toWrite)
       if int(length) >= item.size:
@@ -171,6 +172,7 @@
         item.offset = item.offset + int(length)
         item.size = item.size - int(length)
         writer.queue.addFirstNoWait(item)
+        echo "DEBUG length = ", $length
         sslEngineSendappAck(engine[], length)
         return TLSResult.Success
     else:

I get this output:

DEBUG buf is nil? false
DEBUG buf is nil? true
DEBUG length = 0
Traceback (most recent call last)
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(1202) waitFor
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(295) poll
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim(176) tlsWriteApp
cheatfate commented 1 year ago

I'm not sure about this fix, i need time to investigate why sslEngineSendappBuf() returns length == 0.

cheatfate commented 1 year ago

@iffy could you please verify one assumption? I need to dump value of writer.handshaked variable when buf == nil.

iffy commented 1 year ago
DEBUG buf is nil? false
DEBUG writer.handshaked: true
DEBUG buf is nil? true
DEBUG writer.handshaked: true
DEBUG length = 0
Traceback (most recent call last)
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(1202) waitFor
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(295) poll
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim(177) tlsWriteApp
SIGBUS: Illegal storage access. (Attempt to read from nil?)

For a little more context, this is happening when the connection is closing. It's at the end of some tests, after I've used the connection and I'm closing everything. It only happens sometimes which makes me wonder if some of the connections are still trying to send data while the connection is closing?

Edit: I would love to include a test case that demonstrates the error, but I've had a difficult time paring down my code and still causing the error (since it only happens sometimes).

cheatfate commented 1 year ago

That's interesting that it raises SIGBUS, not SIGSEGV :)

cheatfate commented 1 year ago

@iffy could you please check my version of the fix https://github.com/status-im/nim-chronos/pull/358 ?

iffy commented 1 year ago

@cheatfate With the fix-sigbus branch c843db28084841d3f20c2cc2f9b6f676a54659a7 instead of SIGBUS I get this traceback (sometimes -- not every time):

Traceback (most recent call last)
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(1202) waitFor
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(295) poll
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim(362) tlsLoop
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim(108) newTLSStreamWriteError
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
cheatfate commented 1 year ago

I just want to explain why your fix is not good, but fixing SIGBUS.

With your fix you are not doing any changes to bearssl's state machine, and you got into this situation when tlsLoop() will start looping and raise cpu usage to 100% because SSL_SENDAPP flag is still present and code will continue creating new futures using tlsWriteApp(). To avoid this we need to stop calling tlsWriteApp(), because bearssl is not going to accept outgoing data anymore.

cheatfate commented 1 year ago

@iffy i have updated #358 could you please check it one more time?

iffy commented 1 year ago

I just want to explain why your fix is not good, but fixing SIGBUS.

Thank you for the explanation -- that makes sense.

i have updated https://github.com/status-im/nim-chronos/pull/358 could you please check it one more time?

It appears to be fixed with fix-sigbus a547800a9a60d30966e7af11c68b385bf77ca8c3! I ran my code 20 times in a row and it didn't fail at all. Usually it fails about 1/8 times, so I think it's good. Thank you for such a thorough fix!

cheatfate commented 1 year ago

@iffy i'm going to close this PR because i have merged #358, i'm very appreciated for this bug information and help, thank you very much!

iffy commented 1 year ago

Sounds good; thanks for the fix!