E2BE-enabled AppServices crash a while (IO error: could not acquire lock)

jaller94 commented 1 year ago

Describe the bug An appservice that enables E2BE-capabilities, may crash after a few hours. It seems to be based on a timer.

https://github.com/matrix-org/matrix-hookshot/issues/609

To Reproduce

Write an encrypted appservice, e.g. Hookshot
Run it for a while (1 to 4 hours)

Expected behavior Don't crash.

Log snippet This Hookshot crash happened exactly one hour after starting the bot, making me think that it is a timeout or other internal event, instead of an incoming network event.

WARN 13:07:44:863 [Config] The `widgets.openIdOverrides` config value SHOULD NOT be used in a production environment.
INFO 13:07:44:873 [Appservice] Initialising Redis storage (on localhost:6379)
INFO 13:07:44:901 [Bridge] Starting up
INFO 13:07:44:901 [UserTokenStore] Loading token key file passkey.pem
INFO 13:07:44:920 [RedisASProvider] Successfully connected
INFO 13:07:44:942 [Bridge] Connecting to homeserver and fetching joined rooms..
INFO 13:07:44:966 [Config] Prefilling room membership for permissions for 0 rooms
INFO 13:07:45:174 [FeedConnection] Connection 0e343504299411063c7324cc5166baa8 created for !MBKReTGbuldfQteUbW:localhost, {"url":"https://chrpaul.de/feed.xml","label":"Christian Paul"}
INFO 13:07:45:176 [FeedConnection] Connection e125d0ada12e1f980c8dfdbaa4020344 created for !MBKReTGbuldfQteUbW:localhost, {"url":"https://www.tagesschau.de/xml/rss2/","label":"Tagesschau"}
INFO 13:07:45:232 [Bridge] Bridge is now ready. Found 2 connections
INFO 13:07:45:235 [ListenerService] Listening on http://0.0.0.0:9000 for webhooks
INFO 13:07:45:235 [ListenerService] Listening on http://127.0.0.1:9001 for metrics, provisioning
INFO 13:07:45:236 [ListenerService] Listening on http://0.0.0.0:9002 for widgets
INFO 13:09:15:146 [Appservice] Processing transaction 7935
INFO 13:09:15:147 [Appservice] Processing event of type m.room.message
INFO 13:09:15:160 [Appservice] 127.0.0.1 - - [10/Jan/2023:12:09:15 +0000] "PUT /transactions/7935?access_token=%3Credacted%3E HTTP/1.1" 200 2 "-" "Synapse/1.74.0"
…
INFO 13:35:15:521 [Appservice] Processing transaction 8038
INFO 13:35:15:521 [Appservice] Processing ephemeral event of type m.presence
INFO 13:35:15:522 [Appservice] 127.0.0.1 - - [10/Jan/2023:12:35:15 +0000] "PUT /transactions/8038?access_token=%3Credacted%3E HTTP/1.1" 200 2 "-" "Synapse/1.74.0"

ERROR 14:07:45:815 [Appservice] Failed to set up crypto on intent @hookshot:localhost [Error: IO error: could not acquire lock on "/home/jaller94/Git/matrix-hookshot/data/encryption/2d03b2ce09b8e551150e45d1418338aa0f43c531f79e8a34a1758bcbbbc850d0/matrix-sdk-crypto/db": Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }] {
  code: 'GenericFailure'
}
Error: IO error: could not acquire lock on "/home/jaller94/Git/matrix-hookshot/data/encryption/2d03b2ce09b8e551150e45d1418338aa0f43c531f79e8a34a1758bcbbbc850d0/matrix-sdk-crypto/db": Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
error Command failed with exit code 1.

Additional context I've seen the same issue in a project different to Hookshot. I don't have access to their code, but as Element employees we can request the code of this customer appservice.

turt2live commented 1 year ago

my hunch is this would be the Intent object expiring, but not cleaning up crypto properly, so when it gets recreated it has a conflicting lock deep within the rust-sdk.

Half-Shot commented 1 year ago

Having looked at this, I'm unsure. The only expiry I can see for intents is the LRU cache, which defaults to 10000 entries. I'd certainly never expect a inactive hookshot to get close to that number. LRU doesn't expire objects until the maximum object threshold is reached, so my gut feeling is it's not expiry that's causing it.

That said: @jaller94 mentioned that the timeout was exactly an hour, and the intents have a TTL of a hour.

The sled database is the thing that's crying out here (https://github.com/spacejam/sled/issues/1234). Sounds like Rust would impliclty cause the DB to be dropped when it falls out of scope, but I don't really know how that works in a JS world. I presume the act of the Intent dropping would cause the OlmMachine to drop, which would cause the Sled DB to drop but maybe there is something missing here.

Half-Shot commented 1 year ago

Right, so. Lots of digging led me to realise that the LRU store will evict things when their ttl is up, regardless of how many items are in the store. This leads to Intents being deleted and immediately recreated, which leads to the Sled DB conflicting with itself as it hasn't been GC'd.

Ultimately, Sled is a bit poor for this use case and we should be using something like Redis or Postgres (both of which appear to be in progress for the rust-sdk). In the meantime, my proposals would be to either disable the LRU, or check the lock status on the Sled DB prior to loading crypto.

Half-Shot commented 1 year ago

Checking the lock status is horrible in Node. I'm going to push for https://github.com/matrix-org/matrix-rust-sdk/pull/1256 support to ultimately fix our DB woes, but in the meantime look at disabling the LRU for encrypted bridges.

turt2live commented 1 year ago

Disabling the LRU has memory exhaustion concerns, fwiw. It currently patches over a memory leak that'd be ideal to fix, but so does just throwing away objects.

The rust-sdk's destructor should be able to handle the lock status internally?

turt2live commented 1 year ago

from oob: hookshot's usecase is covered by disabling the LRU, with consequences being known due to urgency.

It's not clear why the rust-sdk isn't destructing itself properly, but the hope is we can just abandon Sled instead of thinking about it.

AndrewFerr commented 1 year ago

This is fixed by using the SQLite-based crypto store that was recently added to the rust-sdk.

rust-sdk change: https://github.com/matrix-org/matrix-rust-sdk/pull/1521
bot-sdk PR (only in the vector-im fork for now): https://github.com/vector-im/matrix-bot-sdk/pull/10

Half-Shot commented 1 year ago

I think we can close this now @turt2live

turt2live commented 1 year ago

I'm still seeing some crashes come up, though agreed that the majority of cases appear to be fixed.

AndrewFerr commented 1 year ago

@turt2live are the crashes you saw from t2bot bridges? Do you have any stack traces / error logs available?

turt2live commented 1 year ago

They were t2bot.io bridges and local development. I don't have stacktraces on hand, sorry (there's other problems with the bridges which are taking my attention).

turt2live / matrix-bot-sdk

E2BE-enabled AppServices crash a while (IO error: could not acquire lock) #293