Open geigerzaehler opened 2 years ago
Is it the same remote_addr
always?
Is it the same
remote_addr
always?
Yes it is. Once I’m connected to more peers this happens with every peer. Also the Oid for the waiting room warning is always the same.
I don’t know of anything which would trigger packets at a 200ms interval, except perhaps an amplification effect of something malfunctioning with the “waiting room”
@geigerzaehler If you could conjure up debug logs from seed that correspond to the upstream peer that would help with narrowing down the interaction that is causing this.
I’ve noticed that once my peer is connected to multiple peers I’m seeing this log line for every peer. This puts significant pressure on the CPU resulting in 50% utilization. This is a high severity issue for Upstream users since the app and their machines become unresponsive
Does this then rule out a potential issue in the seed node/library?
@geigerzaehler If you could conjure up debug logs from seed that correspond to the upstream peer that would help with narrowing down the interaction that is causing this.
Here’s the seed log just before the issue starts. Afterwards the log basically repeats itself.
A couple of things I noticed when I investigated this and that may contribute to the issue
kv
for persistence blocking on IO? If so, this might be a problem because we don’t spawn them on a separate thread?tokio::task::spawn_blocking
instead of unblocking
throughout the daemon codebasedaemon::peer::announcement::run
does read and write to the store but calls are not synchronized. It is possible that two calls of run()
execute concurrently and mess with each other.Seems like https://github.com/radicle-dev/radicle-client-services/issues/5 is related, which doesn't have a waiting room or complex behavior. It seems like peers are establishing large amounts of connections to other peers, something we've been seeing on the seed node for a while.
It could either be malicious behavior, or some kind of feedback loop..
In any case it might help to limit the amount of ingress streams per-peerid, as a mitigation.
Something I find weird about this log is that a peer keeps advertising the same Have
repeatedly:
24 Jul 28 10:00:52.008 DEBUG accept{peer_id=hybz9gfgtd9d4pd14a6r66j5hz6f77fed4jdu7pana4fxaxbt369kg}:incoming:incoming{remote_id=hyd7cb39wws6jz5wmifzwtjn8u57siw6knipqttk6nhc657r1fozaw remote_addr=188.120.235.3:12345}:tock{tock=AttemptSend { to: GenericPeerInfo { peer_id: hyd7cb39wws6jz5wmifzwtjn8u57siw6knipqttk6nhc657r1fozaw, advertised_info: PeerAdvertisement { listen_addrs: Within { inner: [188.120.235.3:12345], min: 0, max: 16 }, capabilities: {} }, seen_addrs: Within { inner: [188.120.235.3:12345], min: 0, max: 16 } }, message: Membership(Neighbour { info: PeerAdvertisement { listen_addrs: Within { inner: [10.156.15.198:12345, 127.0.0.1:12345], min: 0, max: 16 }, capabilities: {} }, need_friends: None }) }}:incoming{remote_id=hyd7cb39wws6jz5wmifzwtjn8u57siw6knipqttk6nhc657r1fozaw remote_addr=188.120.235.3:12345}:tock{tock=SendConnected { to: hybdj8d47ew35471nu4gs7pco8p9xib1iksdmtygy43ghyu5ypaumg, message: Gossip(Have { origin: GenericPeerInfo { peer_id: hydd5idud5sbtkdmb8cmuprpf3rsj4ajper3j4738f9efw7nunpgxe, advertised_info: PeerAdvertisement { listen_addrs: Within { inner: [127.0.0.1:37109, 192.168.4.41:37109], min: 0, max: 16 }, capabilities: {} }, seen_addrs: Within { inner: [], min: 0, max: 16 } }, val: Payload { urn: Urn { id: Oid(1183cf230ec31058fa955e69b4172086161641de), path: Some(RefLike("rad/id")) }, rev: Some(Git(b804589da7f790d471c8c1870d880f7ac6b471d9)), origin: None } }) }}: librad::net::protocol::tick: tock
Doing rg b804589da7f790d471c8c1870d880f7ac6b471d9 seed.log | wc
will bring up 64 occurrences. They also happen very close to each other.
In any case it might help to limit the amount of ingress streams per-peerid, as a mitigation.
But radicle-link recently added rate limiting, so this is even more curious :thinking:
Rate limiting is limited due to the stream-per-message model. That was a bad idea, I'm changing it. At the end of the day, a peer flooding the network cannot really be defended against, unless we auto-ban them permanently.
Any update on this? This issue often makes Upstream unusable :(
Any update on this? This issue often makes Upstream unusable :(
I thought the last thing we talked about was Upstream updating librad
to this commit https://github.com/radicle-dev/radicle-link/commit/cb91ccdd07d08276112c876a670941f9be8f18ef and seeing how that might improve usage?
Also heard from @cloudhead on matrix that it was sprout misbehaving. Did no further action follow this statement?
I thought the last thing we talked about was Upstream updating
librad
to this commit cb91ccd and seeing how that might improve usage?
That didn’t help unfortunately. We’re still seeing this issue.
Also heard from @cloudhead on matrix that it was sprout misbehaving.
I was also seeing this issue with setzling. And this was after I restarted it and was the only peer connected to it.
I was also seeing this issue with setzling. And this was after I restarted it and was the only peer connected to it.
And there was no sign of some peer repeatedly making requests?
I thought the last thing we talked about was Upstream updating librad to this commit [1]cb91ccd and seeing how that might improve usage?
That didnât help unfortunately. Weâre still seeing this issue.
Yeah no, the effect of this can only kick in when most peers are running on that version. Only then can we tighten the flow control limits to drop packets from a logorrheic peer on the floor.
Also heard from ***@***.*** on matrix that it was sprout misbehaving.
I was also seeing this issue with setzling. And this was after I restarted it and was the only peer connected to it.
So, what are we seeing exactly? That seed
nodes are flooding, or that
upstream
nodes are flooding as soon as they are connected to a seed
? Is it
perhaps that the seed
should check if the project is already tracked here 0,
because the call to track
is idempotent?
I’m not able to reproduce this issue anymore if I’m the only one connected to a seed node. But I still see this issue if other peers are connected. After adding some logging statements I see the following when I connect to sprout.
Aug 18 16:54:25.173 INFOlibrad::net::peer::storage: gossip received, has: Payload { urn: Urn { id: Oid(8cc143b5b061d9882726564702017b5f94bb0f1a), path: Some(RefLike("rad/id")) }, rev: Some(Git(f2949476d30de0ee650598ba16aee4a4601157da)), origin: None }, provider: hyde43uaaqz5bnh8kpqfwiwx7inxzdmmtiy1ndtf6zazfrcny3ossr
at /home/thomas/code/radicle-link/librad/src/net/peer/storage.rs:189
in librad::net::peer::storage::put with has: Payload { urn: Urn { id: Oid(8cc143b5b061d9882726564702017b5f94bb0f1a), path: Some(RefLike("rad/id")) }, rev: Some(Git(f2949476d30de0ee650598ba16aee4a4601157da)), origin: None }
in librad::net::protocol::broadcast::apply with remote_id: hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc, Have { origin: GenericPeerInfo { peer_id: hyde43uaaqz5bnh8kpqfwiwx7inxzdmmtiy1ndtf6zazfrcny3ossr, advertised_info: PeerAdvertisement { listen_addrs: Within { inner: [127.0.0.1:57843, 192.168.1.108:57843], min: 0, max: 16 }, capabilities: {} }, seen_addrs: Within { inner: [], min: 0, max: 16 } }, val: Payload { urn: Urn { id: Oid(8cc143b5b061d9882726564702017b5f94bb0f1a), path: Some(RefLike("rad/id")) }, rev: Some(Git(f2949476d30de0ee650598ba16aee4a4601157da)), origin: None } }
in librad::net::protocol::io::streams::incoming with remote_id: hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc, remote_addr: 35.187.60.160:12345
in librad::net::protocol::io::discovered with remote_id: hynkyndc6w3p8urucakobzna7sxwgcqny7xxtw88dtx3pkf7m3nrzc
in librad::net::protocol::accept::disco
in librad::net::protocol::accept with peer_id: hyb9fda6y8tuwqn73wfnh1qf4m9cau714o5xf14t7wgwbkdhs138ay
This log line is repeated every couple of milliseconds. As you can see, the message arrives from the seed but originates from another node.
Could one way to limit the impact of this be to have the waiting room not react to uninteresting messages?
Could one way to limit the impact of this be to have the waiting room not react to uninteresting messages?
The thing is that if you've requested an Urn
you're not guaranteed to know which PeerId
it'll come from so that counts as "uninteresting" in the gossip world, however, it's interesting to the waiting room because it found a (possibly) valid peer to replicate from.
Why do you think this could be the waiting room anyway? :thinking: Maybe I'm missing something you see in the logs there? :eyes:
Why do you think this could be the waiting room anyway?
I’m not saying the waiting room causes the issue. But it seems odd to me that the messages put so much pressure on the CPU. I was just thinking about limiting the impact of messages that don’t have an effect.
So, what are we seeing exactly? That
seed
nodes are flooding, or thatupstream
nodes are flooding as soon as they are connected to aseed
? Is it perhaps that theseed
should check if the project is already tracked here [0], because the call totrack
is idempotent?
If anything this is probably the source of some unwanted traffic and we should definitely fix it.
I must say that this waiting room logic escapes me mostly. It is probably the case that the bandaid of screaming "have" into the forest should just be removed -- unfortunately, #653 is still stalled, because replication v3 is more important right now.
In the spirit of graceful centralisation, I could imagine the following workaround: since most clients (upstream
clients for sure) bootstrap through a "central" seed, we could have some kind of no-forward flag on wants/haves, which simply cause the seed to respond if they have the wanted data, but not forward the message.
That's kind of interesting - is this a soft-fork?
is this a soft-fork?
I'm not sure what you mean by that.
Hehe I mean is it a backwards compatible change?
Yeah sure. It would be an optional field, which should just be ignored by older clients.
Don’t we have some rate limiting in place that should prevent a peer being flooded by messages from another peer?
Donât we have some rate limiting in place that should prevent a peer being flooded by messages from another peer?
Yes we do, and I saw in one of the earlier log captures you posted that it kicks in. Might be worth playing with the thresholds.
When connecting to either
setzling
orsprout
we see the following log line every 200msIn addition, we also see the following log line for a while until it goes away (maybe after replicating a first project)
This happens even if I start Upstream with a clean slate and I don’t replicate any projects.
I’ve noticed that once my peer is connected to multiple peers I’m seeing this log line for every peer. This puts significant pressure on the CPU resulting in 50% utilization. This is a high severity issue for Upstream users since the app and their machines become unresponsive
Steps to reproduce
RAD_HOME