libp2p / specs

Technical specifications for the libp2p networking stack
https://libp2p.io
1.55k stars 273 forks source link

gossipsub: feature-request: Optional reason code for why a peer was pruned #555

Open MarcoPolo opened 1 year ago

MarcoPolo commented 1 year ago

It would be very helpful for debugging and health monitoring of the network to know why a peer pruned us. Even if they only tell we were pruned because our score became negative, that would be helpful.

I don't think there's a security issue here since a node can essentially infer it is misbehaving if many peers prune at once. This just makes that explicit.

This came up debugging https://github.com/filecoin-project/lotus/issues/10906, and @shrenujbansal suggested this. It would be useful to know that a peer gave us a negative score because it would hint that we did something wrong.

vyzo commented 1 year ago

I dont think this is particulalry useful, most likely you got pruned because of a negative score.

shrenujbansal commented 1 year ago

Correct me if I'm wrong but I believe you can also get pruned if an existing peer has too many peers (above the high watermark) and prunes a bunch of peers as a result

Being able to tell if several peers were pruning you due to a negative score, as a result of some activity would become very useful in debugging the sort of issues like #10906 where your node is not receiving blocks and losing sync as a result If we're able to see this number tick up in a grafana dashboard, it immediately gives us more clues as to what is going on, rather than figuring this out via lots of speculation, additional logging and experiments

vyzo commented 1 year ago

The metric you care about is number of prunes and you already have that. Your thinking that all peers are pruning you because of oversubscription is probably going into the realm of the extremely unlikely.

If you see a massive prune spike in your metrics, you should be thinking score. Having the extra bit wont add much.

In short I dont think having this field will help you in your problem, and it is an information leak of sorts I am very reluctant to add.

On Thu, Jun 29, 2023, 9:31 PM Shrenuj Bansal @.***> wrote:

Correct me if I'm wrong but I believe you can also get pruned if an existing peer has too many peers (above the high watermark) and prunes a bunch of peers as a result

Being able to tell if several peers were pruning you due to a negative score, as a result of some activity would become very useful in debugging the sort of issues like #10906 where your node is not receiving blocks and losing sync as a result If we're able to see this number tick up in a grafana dashboard, it immediately gives us more clues as to what is going on, rather than figuring this out via lots of speculation, additional logging and experiments

— Reply to this email directly, view it on GitHub https://github.com/libp2p/specs/issues/555#issuecomment-1613611395, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAI4SVA7PYNYIZ3UIHJ7QTXNXCWNANCNFSM6AAAAAAZX7XP7Y . You are receiving this because you commented.Message ID: @.***>

vyzo commented 1 year ago

I should add that even if you are pruned brcause of oversubscription, the score is factored in; so you are getting pruned because you have a lower score than you peers. So you see, there is no clear diatiction between the two and you cant even define the difference without going deep into internal state, at which point you start to leak.

On Thu, Jun 29, 2023, 9:49 PM Dimitris Vyzovitis @.***> wrote:

The metric you care about is number of prunes and you already have that. Your thinking that all peers are pruning you because of oversubscription is probably going into the realm of the extremely unlikely.

If you see a massive prune spike in your metrics, you should be thinking score. Having the extra bit wont add much.

In short I dont think having this field will help you in your problem, and it is an information leak of sorts I am very reluctant to add.

On Thu, Jun 29, 2023, 9:31 PM Shrenuj Bansal @.***> wrote:

Correct me if I'm wrong but I believe you can also get pruned if an existing peer has too many peers (above the high watermark) and prunes a bunch of peers as a result

Being able to tell if several peers were pruning you due to a negative score, as a result of some activity would become very useful in debugging the sort of issues like #10906 where your node is not receiving blocks and losing sync as a result If we're able to see this number tick up in a grafana dashboard, it immediately gives us more clues as to what is going on, rather than figuring this out via lots of speculation, additional logging and experiments

— Reply to this email directly, view it on GitHub https://github.com/libp2p/specs/issues/555#issuecomment-1613611395, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAI4SVA7PYNYIZ3UIHJ7QTXNXCWNANCNFSM6AAAAAAZX7XP7Y . You are receiving this because you commented.Message ID: @.***>

shrenujbansal commented 1 year ago

The metric you care about is number of prunes and you already have that. Your thinking that all peers are pruning you because of oversubscription is probably going into the realm of the extremely unlikely. If you see a massive prune spike in your metrics, you should be thinking score.

One thing I wanna confirm is when you say "number of prunes", do you mean the number of prunes by the current node or number of prunes of the current node by others? We want to try to see the latter if possible

@MarcoPolo @vyzo mentions that we already have the number of prunes metric. Is this something also immediately visible on grafana or can be made visible easily?

vyzo commented 1 year ago

It is definitely possible, although lotus might not have the right metric atm.

On Thu, Jun 29, 2023, 10:44 PM Shrenuj Bansal @.***> wrote:

The metric you care about is number of prunes and you already have that. Your thinking that all peers are pruning you because of oversubscription is probably going into the realm of the extremely unlikely. If you see a massive prune spike in your metrics, you should be thinking score.

One thing I wanna confirm is when you say "number of prunes", do you mean the number of prunes by the current node or number of prunes of the current node by others? We want to try to see the latter if possible

@MarcoPolo https://github.com/MarcoPolo @vyzo https://github.com/vyzo mentions that we already have the number of prunes metric. Is this something also immediately visible on grafana or can be made visible easily?

— Reply to this email directly, view it on GitHub https://github.com/libp2p/specs/issues/555#issuecomment-1613704909, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAI4SVWRFUEJJHACMUOYO3XNXLK5ANCNFSM6AAAAAAZX7XP7Y . You are receiving this because you were mentioned.Message ID: @.***>

shrenujbansal commented 1 year ago

vyzo mentions we already have the number of prunes metric. Is this something also immediately visible on grafana or can be made visible easily?

@MarcoPolo do you have any idea?

MarcoPolo commented 1 year ago

It would be here: https://github.com/filecoin-project/lotus/blob/master/node/modules/lp2p/pubsub.go#L562 like how there is stats.Record calls. I don't think this is implemented.