stbrody commented 1 year ago

Checklist

[X] This is a bug report, not a question. Ask questions on discuss.ipfs.io.
[X] I have searched on the issue tracker for my bug.
[X] I am running the latest kubo version or have an issue updating.

Installation method

ipfs-update or dist.ipfs.tech

Version

it's a spread.  Different nodes on the network are running different versions.  We cannot control when all node operators upgrade their nodes.

Config

n/a

Description

Incident report from 3Box Labs (Ceramic) Team

Incident summary

The Ceramic pubsub topic has been experiencing a flood of pubsub messages beyond our usual load for the last several days now. We log every pubsub message we receive on the nodes that we run, and running analysis on those logs using LogInsights shows us that we are receiving messages with the exact same seqno multiple times - one message can show up upwards of 15 times in an hour. During normal operation we do not åsee this issue with seqnos showing up multiple times. This dramatic increase in the number of messages that need processing is causing excess load on our nodes that is causing major performance problems, even with as much caching and de-duplication as we can do at our layer.

Evidence of the issue

Graph of our incoming pubsub message activity showing how the number of messages spiked way up a few days ago. The rate before 2/20 was our normal, expected amount of traffic:

AWS LogInsights Query demonstrating how the majority of this increased traffic is due to seeing the same message (with the same seqno) re-delivered multiple times. Before the spike we never saw a msg_count greater than 2.

Steps to reproduce

Connect to the gossipsub topic /ceramic/mainnet. Observe the messages that come in, keep track of the number of times you see a message with each seqno. You'll see that over the span of an hour you see the same message with the same seqno delivered multiple times

Historical context

We have seen this happen before, in fact it's happened to us several times over the last year, and we've reported it to PL multiple times. You can see our original report here (at the time we were still using js-ipfs): https://github.com/libp2p/js-libp2p/issues/1043. When this happened again after we had migrated to go-ipfs, we reported it again, this time on slack: https://filecoinproject.slack.com/archives/C025ZN5LNV8/p1661459082059149?thread_ts=1661459082.059149&cid=C025ZN5LNV8

We have since discovered a bug in how go-libp2p-pubsub maintained the seenMessage cache and worked to get a fix into kubo 0.18.1: https://github.com/libp2p/go-libp2p-pubsub/issues/502

We have updated our nodes to 0.18.1, but of course we have no direct control over what versions of ipfs/kubo the rest of the nodes on the Ceramic network are running, so even if the above bugfix would resolve the issue if every single node on the network were to upgrade to it, we have no real way to enforce that and no idea how long it will be (if ever) before there are no older ipfs nodes participating in our pubsub topic. Not to mention the possibility of a malicious node connecting to our pubsub topic and publishing a large volume of bogus messages (or re-broadcasting valid messages). So no matter what, we need a way to respond to incidents like this that goes beyond "get your users to upgrade to the newest kubo and pray that that makes the problem go away", which has been what we've been told every time we're reported this issue so far.

Our request from Protocol Labs

This is an extremely severe incident that has affected us multiple times over the last year. It strikes without warning and leaves our network crippled. Every previous time this happened it cleared up on its own within a day or so, but this one has been going on for 5 days now without letting up. We need some way to respond to incidents like this, and to potential malicious attacks in the future where someone intentionally floods our network with pubsub traffic.

So our questions for PL are:

What short term options can we take on the nodes that we operate, or that we can tell our community of node operators to take, to resolve this immediate issue that is currently affecting our production network?
Do you have any tools for inspecting the p2p network that would let us identify which node(s) are the source of the issue? If we knew that the issue was because of one problematic node that was, for instance, running a very old version of ipfs or running on very underpowered hardware, we could potentially reach out to them directly and get them to upgrade or take down their node. Or perhaps we could tell existing node operators to block connections from that problematic peer.
What is the recommended way in general to respond to nodes that (either intentionally through malice or accidentally through a bug) spam a pubsub topic with bogus or re-broadcast messages?
What additional steps can we take going forward to prepare our network to be more resilient to issues like this in the future?

Thank you for your time and attention to this important issue!

-Spencer, Ceramic Protocol Engineer

stbrody commented 1 year ago

This is a companion report to a report that was already filed against go-libp2p-pubsub here: https://github.com/libp2p/go-libp2p-pubsub/issues/524

The main advice we received from @vyzo working on go-libp2p-pubsub was to utilize libp2p validators for our incoming pubsub messages. Is that something that kubo exposes currently?

Also, even if the libp2p validators are exposed via kubo APIs, it will take time to set up the right ones and get our community of node operators to upgrade to utilize them. In the meantime our network is still in a bad state right now so I'm very interested in any ideas for what we can do in the short term to reach a stable state again.

Thank you!

vyzo commented 1 year ago

A possible solution that can solve your current predicament is to provide a default validator in ipfs that uses the message envelope seqno as the nonce.

This works for all peers/topics, and can reject/ignore old messages without needing to look into the message payload itself.

This can be implemented quite easily and provide the bandaid needed to stop the bleeding and buy time to design the proper abstractions for a validator api.

vyzo commented 1 year ago

A bit more on the mechanics of this.

The proposed validator relies on message signing and use of the seqno, which are default features used by ipfs. This means that every message has an origin peer, a signature (validated by pubsub lib itself) and a monotonically incrementing seqno in the envelope (constructed by the pubsub lib, initialized with the current unix timestamp in nanos and incremented thereafter).

Thus we can use this seqno as a per peer nonce and extinguish the kind of floods that is being observed here. Note that the per peer nonce shouldnt in generally be kept in memory (modulo caching) to avoid creating a dos vector.

The validator can be registered by every topic thats is joined, or we can add an api in pubsub for default validators for all topics.

smrz2001 commented 1 year ago

Thanks for the ideas, @vyzo! This is very helpful.

So we could expose RegisterTopicValidator and UnregisterTopicValidator like we did with WithSeenMessagesTTL (but as a function, of course)?

vyzo commented 1 year ago

I plan on working on this tmrw, as it is holiday here today. Unless you want to pick it up today of course.

I will add a new api to pubsub for default validators, and write a default validator that behaves as described in kubo. No need to expose anything new.

smrz2001 commented 1 year ago

I plan on working on this tmrw, as it is holiday here today. Unless you want to pick it up today of course.

I will add a new api to pubsub for default validators, and write a default validator that behaves as described in kubo. No need to expose anything new.

Would really appreciate the help, @vyzo 🙏🏼 Our whole team is all-hands-on-deck debugging and testing in preparation for a major product launch at Eth Denver over the next couple of days (hence our concern about Pubsub flooding at this time).

p-shahi commented 1 year ago

The current plan: kubo maintainers will try to get out a patch release that includes Vyzo's proposed solution this week.

stbrody commented 1 year ago

Really appreciate the attention here @vyzo and @p-shahi!!!

BigLep commented 1 year ago

So there are a couple of parties involved here:

vyzo@ for the go-libp2p-pubsub work
ipfs/kubo-maintainers for doing a Kubo release.

This is on the Kubo maintainers' 2023-02-28 standup agenda to discuss to figure out what we can do here. We understand this is time sensitive for you all. We'll update the issue after standup (by 19UTC on 2023-02-28).

Jorropo commented 1 year ago

Datastore API we want pubsub validator to consume: https://pkg.go.dev/github.com/ipfs/go-datastore#Datastore (with a wrapper in Kubo side).

BigLep commented 1 year ago

Notes from 2023-02-28 conversation with @Jorropo and @vyzo : @vyzo is going to make the go-libp2p-pubsub changes today (2023-02-28). I'll ask @MarcoPolo to review. @vyzo will make the change in Kubo tomorrow (2023-03-01). He'll do this change against 0.18.1. @Jorropo will review. @galargh will cut a patch release. This will happen on Thursday, 2023-03-02, European morning at the latest.

BigLep commented 1 year ago

Ceramic: a few things:

General: Kubo maintainers / PL EngRes is working to help support your launch/announcements this week.
Is there a slack channel where you usually engage with PL folks in? (Apologies for not finding it as I know we have had recent conversations here. We can also talk in #ipfs-operators.)
Per above we're aiming to get a patch release for you on Thursday European time. You can track this here: https://github.com/ipfs/kubo/issues/9679
After you get through your event, I think it's worth having an architecture conversation as on the surface I'm a bit worried about Kubo the binary being a critical part of your setup. I believe since the past conversation when you all moved from js-ipfs the landscape has changed including:
- Additional investment / changes in js-libp2p including full DHT support and up-and-coming js-ipfs replacement in Helia (not saying though you need to go back to JS world - just wanting to make sure you're aware)
- Ability to more easily customize Kubo with FX plugins
- go-libipfs effort for empowering groups to build the ipfs implementation they want I'd like to educate myself and other team members on your usescass/needs and ensure we're steering you in a direction that best supports you.

vyzo commented 1 year ago

The pubsub side of things is here: https://github.com/libp2p/go-libp2p-pubsub/pull/525 This provides the necessary API support and an implementation, modulo an interface to the datastore.

stbrody commented 1 year ago

Thanks @vyzo @BigLep for all your work driving this forward! Wanted to let you all know that at the end of the day yesterday the pubsub flood resolved, so we are no longer in an active crisis situation at the moment. That said, this has happened before and will likely happen again, so we're still very interested in getting the fixes you all are working on to help protect us going forward. And we'd absolutely love to have an architecture review and discuss how we can improve the way we build on kubo/libp2p to set us up for success in the future.

Also @BigLep, we do have a shared channel in the filecoin slack: https://filecoinproject.slack.com/archives/C01V5AWPF97

vyzo commented 1 year ago

Pubsub release has been cut in https://github.com/libp2p/go-libp2p-pubsub/releases/tag/v0.9.2

I will hook it up to ipfs next.

BigLep commented 1 year ago

@stbrody : you bet. Given you're not in an active crisis situation, we are just going to include this in 0.19 (RC is planned for tomorrow). Because of dependency updates, it is unfortunately more challenging to backport to a 0.18.2. We can do it if necessary but want to avoid the legwork if it isn't needed. (Doing a patch release also involves some manual work so we'll save there as well.)

Concerning architecture review, let's connect after Denver. Feel free to reach out on FIL Slack when you're back and settled.

stbrody commented 1 year ago

FYI @vyzo @BigLep - this is happening to us on our mainnet again. Seems to have started about 28 hours ago when everyone on our team was traveling back from EthDenver

BigLep commented 1 year ago

@stbrody : ACK. @Jorropo is working to get CI to pass so https://github.com/ipfs/kubo/pull/9684 can be merged.

BigLep commented 1 year ago

@stbrody : we've run into some snags here. @Jorropo will get these moved a top level issue tomorrow (2023-03-10). Assuming you're all back from EthDenver, I want to get the ball rolling on understanding your needs and how we unblock you since per above we aren't going to merge the PR fix into master.

Can you please invite @BigLep , @Jorropo , and @lidel into a collaboration channel in Filecoin Slack to help with the coordination? (Or if you want me to create a new channel I can do that.). (I can't currently access https://filecoinproject.slack.com/archives/C01V5AWPF97 - I assume it's private.)

BigLep commented 1 year ago

Slack access has been granted and I have moved the conversation there for scheduling.

ipfs / kubo

Pubsub flood due to same message propagated multiple times #9665

Checklist

Installation method

Version

Config

Description

Incident report from 3Box Labs (Ceramic) Team

Incident summary

Evidence of the issue

Steps to reproduce

Historical context

Our request from Protocol Labs