Closed iffy closed 1 year ago
I'm not sure i understand why you think that something was called by an error?
Your stacktrace
points into Line 177.
https://github.com/status-im/nim-chronos/blob/8fcbe716b2f069480277aa6781782d3e311ee2c0/chronos/streams/tlsstream.nim#L177
But for some reason you blame line 174, and why you think that sslEngineSendappBuf()
sends nil
if sslEngineCurrentState == SSL_SENDAPP
?
I mean that it is making a call to sslEngineSendappAck
when sslEngineSendappBuf
has returned NULL and it shouldn't do that. The BearSSL docs say not to call sslEngineSendappAck
with length 0. I put debugging statements in my copy of chronos and found that sslEngineSendappAck
was being called with length 0. So I put more debugging statements in and found that sslEngineSendappBuf
was returning nil
, which the docs say it will sometimes do. When I put in the fix in this PR, it no longer crashed.
Edit: I think the line numbers got mixed up because of the debugging statements I added.
Do you agree with the BearSSL docs that code should not call sslEngineSendappAck
with length 0?
I've removed my patch and reproduced the error with the correct line numbers:
Traceback (most recent call last)
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(1202) waitFor
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(295) poll
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim(174) tlsWriteApp
When I run it with this patch in place:
--- /tmp/tlssstream.nim 2023-02-06 14:40:26.000000000 -0500
+++ /Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim 2023-02-06 14:40:28.000000000 -0500
@@ -157,6 +157,7 @@
if item.size > 0:
var length = 0'u
var buf = sslEngineSendappBuf(engine[], length)
+ echo "DEBUG buf is nil? ", $buf.isNil
let toWrite = min(int(length), item.size)
copyOut(buf, item, toWrite)
if int(length) >= item.size:
@@ -171,6 +172,7 @@
item.offset = item.offset + int(length)
item.size = item.size - int(length)
writer.queue.addFirstNoWait(item)
+ echo "DEBUG length = ", $length
sslEngineSendappAck(engine[], length)
return TLSResult.Success
else:
I get this output:
DEBUG buf is nil? false
DEBUG buf is nil? true
DEBUG length = 0
Traceback (most recent call last)
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(1202) waitFor
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(295) poll
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim(176) tlsWriteApp
I'm not sure about this fix, i need time to investigate why sslEngineSendappBuf()
returns length == 0
.
@iffy could you please verify one assumption?
I need to dump value of writer.handshaked
variable when buf == nil
.
DEBUG buf is nil? false
DEBUG writer.handshaked: true
DEBUG buf is nil? true
DEBUG writer.handshaked: true
DEBUG length = 0
Traceback (most recent call last)
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(1202) waitFor
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(295) poll
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim(177) tlsWriteApp
SIGBUS: Illegal storage access. (Attempt to read from nil?)
For a little more context, this is happening when the connection is closing. It's at the end of some tests, after I've used the connection and I'm closing everything. It only happens sometimes which makes me wonder if some of the connections are still trying to send data while the connection is closing?
Edit: I would love to include a test case that demonstrates the error, but I've had a difficult time paring down my code and still causing the error (since it only happens sometimes).
That's interesting that it raises SIGBUS, not SIGSEGV :)
@iffy could you please check my version of the fix https://github.com/status-im/nim-chronos/pull/358 ?
@cheatfate With the fix-sigbus
branch c843db28084841d3f20c2cc2f9b6f676a54659a7 instead of SIGBUS I get this traceback (sometimes -- not every time):
Traceback (most recent call last)
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(1202) waitFor
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/asyncloop.nim(295) poll
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim(362) tlsLoop
/Users/matt/.nimble/pkgs/chronos-3.0.11/chronos/streams/tlsstream.nim(108) newTLSStreamWriteError
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
I just want to explain why your fix is not good, but fixing SIGBUS.
With your fix you are not doing any changes to bearssl's state machine, and you got into this situation when tlsLoop()
will start looping and raise cpu usage to 100% because SSL_SENDAPP
flag is still present and code will continue creating new futures using tlsWriteApp()
. To avoid this we need to stop calling tlsWriteApp()
, because bearssl
is not going to accept outgoing data anymore.
@iffy i have updated #358 could you please check it one more time?
I just want to explain why your fix is not good, but fixing SIGBUS.
Thank you for the explanation -- that makes sense.
i have updated https://github.com/status-im/nim-chronos/pull/358 could you please check it one more time?
It appears to be fixed with fix-sigbus
a547800a9a60d30966e7af11c68b385bf77ca8c3! I ran my code 20 times in a row and it didn't fail at all. Usually it fails about 1/8 times, so I think it's good. Thank you for such a thorough fix!
@iffy i'm going to close this PR because i have merged #358, i'm very appreciated for this bug information and help, thank you very much!
Sounds good; thanks for the fix!
I have some code that sometimes ends up in a state where
sslEngineSendappAck
is being called withlength = 0
here https://github.com/status-im/nim-chronos/blob/8fcbe716b2f069480277aa6781782d3e311ee2c0/chronos/streams/tlsstream.nim#L174It immediately fails with:
The BearSSL docs for
br_ssl_engine_sendapp_ack
state:The BearSSL docs for
br_ssl_engine_sendapp_buf
state:I've verified that when
length == 0
the return value ofsslEngineSendappBuf
is indeednil
.I believe the
TLSResult.Success
value is the right return value in this case, because BearSSL might be ready later to receive the data (and it makes my code work), but I could be wrong on that point.