Open jaller94 opened 1 year ago
my hunch is this would be the Intent object expiring, but not cleaning up crypto properly, so when it gets recreated it has a conflicting lock deep within the rust-sdk.
Having looked at this, I'm unsure. The only expiry I can see for intents is the LRU cache, which defaults to 10000 entries. I'd certainly never expect a inactive hookshot to get close to that number. LRU
doesn't expire objects until the maximum object threshold is reached, so my gut feeling is it's not expiry that's causing it.
That said: @jaller94 mentioned that the timeout was exactly an hour, and the intents have a TTL of a hour.
The sled database is the thing that's crying out here (https://github.com/spacejam/sled/issues/1234). Sounds like Rust would impliclty cause the DB to be dropped when it falls out of scope, but I don't really know how that works in a JS world. I presume the act of the Intent dropping would cause the OlmMachine to drop, which would cause the Sled DB to drop but maybe there is something missing here.
Right, so. Lots of digging led me to realise that the LRU store will evict things when their ttl
is up, regardless of how many items are in the store. This leads to Intents being deleted and immediately recreated, which leads to the Sled DB conflicting with itself as it hasn't been GC'd.
Ultimately, Sled is a bit poor for this use case and we should be using something like Redis or Postgres (both of which appear to be in progress for the rust-sdk). In the meantime, my proposals would be to either disable the LRU, or check the lock status on the Sled DB prior to loading crypto.
Checking the lock status is horrible in Node. I'm going to push for https://github.com/matrix-org/matrix-rust-sdk/pull/1256 support to ultimately fix our DB woes, but in the meantime look at disabling the LRU for encrypted bridges.
Disabling the LRU has memory exhaustion concerns, fwiw. It currently patches over a memory leak that'd be ideal to fix, but so does just throwing away objects.
The rust-sdk's destructor should be able to handle the lock status internally?
from oob: hookshot's usecase is covered by disabling the LRU, with consequences being known due to urgency.
It's not clear why the rust-sdk isn't destructing itself properly, but the hope is we can just abandon Sled instead of thinking about it.
This is fixed by using the SQLite-based crypto store that was recently added to the rust-sdk.
vector-im
fork for now): https://github.com/vector-im/matrix-bot-sdk/pull/10I think we can close this now @turt2live
I'm still seeing some crashes come up, though agreed that the majority of cases appear to be fixed.
@turt2live are the crashes you saw from t2bot bridges? Do you have any stack traces / error logs available?
They were t2bot.io bridges and local development. I don't have stacktraces on hand, sorry (there's other problems with the bridges which are taking my attention).
Describe the bug An appservice that enables E2BE-capabilities, may crash after a few hours. It seems to be based on a timer.
https://github.com/matrix-org/matrix-hookshot/issues/609
To Reproduce
Expected behavior Don't crash.
Log snippet This Hookshot crash happened exactly one hour after starting the bot, making me think that it is a timeout or other internal event, instead of an incoming network event.
Additional context I've seen the same issue in a project different to Hookshot. I don't have access to their code, but as Element employees we can request the code of this customer appservice.