Improve recently seen address cache

hg commented 2 years ago

See #373.

Cache directory is recreated as necessary. While it may be annoying for some users, I don't think it has any real downsides (besides maybe adding a small amount of flash wear on old routers), and there are lots of existing configurations out there that could benefit from this cache. I don't expect many users to read release notes for all software they use.

hg commented 2 years ago

This also flushed out an old bug which I've often run into in the past:

setup node pub with a publicly available IP and port.
put node nat behind NAT or firewall.
join them somehow so they know each other's addresses.
pub starts connecting to nat and fails because of firewall/NAT.
meanwhile, nat successfully connects back to pub.
pub still thinks it has a pending outgoing connection to nat.
pub adds nat's public address to address cache (in pong handler, graph.c, or any other place), not realizing it's using a "client" port which won't be available anymore after the current connection is closed.

One might wonder why node nat is being given a public address when it cannot accept incoming connections, but it makes sense if there are multiple nodes behind the firewall. External nodes can be configured to ignore them, at the cost of much additional complexity in Ansible playbooks.

Here's how it looks in logs:

2022-05-20 18:12:13 NOTICE  Connection from x.y.z.q port 60310
2022-05-20 18:12:13 NOTICE  Trying to re-establish outgoing connection in 5 seconds
2022-05-20 18:12:13 NOTICE  Connection with foo (x.y.z.q port 60310) activated
2022-05-20 18:12:18 INFO    Already connected to foo
2022-05-20 18:12:18 DEBUG   Caching recent address for foo

After restarting tincd it uses wrongly cached ports (both nodes should use 655):

2022-05-20 18:14:56 DEBUG   Error while connecting to foo (x.q.z.y port 35870): Connection refused
2022-05-20 18:14:57 DEBUG   Error while connecting to bar (x.y.z.q port 60310): Connection refused

The last commit attempts to fix that. It may be completely wrong because I'm not sure why we're retrying outgoing connections if there already is a working connection.

gsliepen commented 2 years ago

I'm getting a failure in the fs test:

not ok 10 - test_fopenmask_new
# 0x1e8 != 0x1c0
# ../test/unit/test_fs.c:210: error: Failure!

The reason is that I have my umask set to 077. The fs unit tests should probably call umask(0) before doing anything else.

gsliepen commented 2 years ago

Thanks, nice work again! One thing I want to note though is that you often add unrelated commits to a pull request. For example, in this PR you fix an outgoing connection issue and add Markdown reformatting to lint.py. Those have nothing to do with the address cache. The refactoring of filesystem-related functions is somewhat related, as it cleans up some of the code dealing with making directories, but it could also have been in a separate PR. Splitting them makes reviewing easier and increases the likelihood that I can merge individual PRs earlier.

gsliepen / tinc

Improve recently seen address cache #388