Reconnections fail (perhaps only via tor)

tsjk commented 1 year ago

I've recently started to use this. I set up a teosd service with tor support (using communication with tor's control port). The cln client (v22.11) always uses a proxy. The problem is that it works for a while and then just stops. I get like repeats of

UNUSUAL plugin-watchtower-client: <tower_id> is unreachable. Adding <id> to pending

This goes on for hours. Abandoning the tower on the client side and re-registering it makes it work for a while again - without touching the server. The client and server are on different systems, different Internet links, and use different tor daemons (for clarity). Both sides use tor v0.4.7.13.

sr-gi commented 1 year ago

It may be helpful if you could provide some Tor logs too, I'm pretty clueless otherwise.

Also, does this happens only with your tower? Have you tried others running over Tor? (e.g. https://github.com/talaia-labs/rust-teos/discussions/158#discussion-4599480)

tsjk commented 1 year ago

I currently only use my own tower. Regarding tor logs, I assume you mean on the client. But, I have logging set to notice, and neither the server nor the client says anything. I could try to increase the debug verbosity?

sr-gi commented 1 year ago

I currently only use my own tower. Regarding tor logs, I assume you mean on the client. But, I have logging set to notice, and neither the server nor the client says anything. I could try to increase the debug verbosity?

I actually meant logs from the Tor daemon (on the client site indeed). You may also increase deps verbosity to see if something is being logged in the plugin logs.

tsjk commented 1 year ago

deps verbosity?

sr-gi commented 1 year ago

deps verbosity?

Oh sorry, nvm, we only have that distinction on the tower-side, not on the client side (the tower side does only log lines regarding dependencies if they are above warning).

Increasing the debug verbosity might help indeed.

tsjk commented 1 year ago

Eh. Well. We'll see. I was actually a bit sloppy and just used the previously running tor relay on the system for CLN. But, when wanting to send debug info I kind of didn't like the idea of sending debug data from a live relay, and so I migrated my CLN away from the tor relay to a tor client. Since then I haven't observed the problem - of course! (>_<) I still get disconnects, but now it says

INFO    plugin-watchtower-client: Retrying tower <tower_id>
...
...
INFO    plugin-watchtower-client: Retry strategy succeeded for <tower_id>

after a few seconds... Previously I didn't see it retrying at all. I'll hold off a bit and update when (if?) the problem re-appears.

sr-gi commented 1 year ago

INFO    plugin-watchtower-client: Retrying tower <tower_id>
...
...
INFO    plugin-watchtower-client: Retry strategy succeeded for <tower_id>
after a few seconds... Previously I didn't see it retrying at all. I'll hold off a bit and update when (if?) the problem re-appears.

This is more the expected behavior. If a post request times out or cannot reach its destination, a retrier is created and data is passed to it. The retrier implements an exponential backoff strategy until the data is finally delivered, or it ends up giving up.

tsjk commented 1 year ago

I still think something is amiss here. I've only quickly glanced through some logs (set to info - debug logs are insanely hard to follow), and my intuition got me wondering whether it'd be useful to request a new circuit at reconnection attempts. I was thinking that perhaps an existing tower has some tor connection info associated with it. This would explain why abandonment followed by re-registration works. I noticed intro_point_is_usable(): Intro point with auth key [scrubbed] had an error. Not usable during disconnect - which made me asking if the retry mechanism retries the wrong thing here. Will try to provide useful logs as time allows.

sr-gi commented 1 year ago

I was thinking that perhaps an existing tower has some tor connection info associated with it. This would explain why abandonment followed by re-registration works.

I don't think I follow

I noticed intro_point_is_usable(): Intro point with auth key [scrubbed] had an error. Not usable during disconnect - which made me asking if the retry mechanism retries the wrong thing here. Will try to provide useful logs as time allows.

Code-wise we don't do anything out of the ordinary here. If a proxy is provided, we just proxy the request through it. I'm not an expert using Tor though so I may be missing something. Let me know if there is anything I can help with, I've been running a tower on Tor for months so if something is iffy I may be able to find some useful logs.

tsjk commented 1 year ago

Yeah, ok. What I was thinking that when the tower is abandoned and re-registered the reconnection attempt is different from a retry. I have no detailed insights into tor either (and I haven't checked what you do in the code), but afaik re-creating the connection to the tor socks proxy will result in requesting a new circuit, while re-use won't. So, if the circuit is broken making the reconnect fail, abandonment will likely necessarily discard the connection and re-registration will create a new connection to the socks proxy thereby resulting in a new circuit being built.

mariocynicys commented 1 year ago

Yeah, ok. What I was thinking that when the tower is abandoned and re-registered the reconnection attempt is different from a retry.

Nope, the only piece of network related info we store for a tower is its onion address. No circuit/connection info. Thus, abandoning a tower shouldn't do anything special.

One question though: can an application using Tor as a proxy only (no access to control port) request a new circuit?

tsjk commented 1 year ago

Yeah, ok. What I was thinking that when the tower is abandoned and re-registered the reconnection attempt is different from a retry.

Nope, the only piece of network related info we store for a tower is its onion address. No circuit/connection info. Thus, abandoning a tower shouldn't do anything special.

One question though: can an application using Tor as a proxy only (no access to control port) request a new circuit?

I think I already answered your question above. :)

mariocynicys commented 1 year ago

I think I already answered your question above. :)

My bad xD.

but afaik re-creating the connection to the tor socks proxy will result in requesting a new circuit

This means we actually use a new circuit every post request. https://github.com/talaia-labs/rust-teos/blob/886e0fffe5cb0e6f0e87edfefd60f4a322a8c6ee/watchtower-plugin/src/net/http.rs#L166-L172

tsjk commented 1 year ago

I don't know how it works behind the scenes. Some APIs cache and reuse sockets. I think the magic on the tor side relies on the client address (this does not work for unix domain sockets, but that can be disregarded here). So, to get a new circuit one needs to change the source port of the call.

talaia-labs / rust-teos

Reconnections fail (perhaps only via tor) #205