Multiple collators started on same machine conflict on prometheus exporter default port 9616

iulianbarbu commented 2 months ago

Is there an existing issue?

[X] I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

[X] This is not a support question.

Description of bug

Description

I started zombienet natively with cumulus/zombienet/tests/0006_rpc-collator-builds-blocks.toml. The network topology has two collators with minimal relay chain nodes which try to setup prometheus by creating a thin hyper server that responds to /metrics requests to localhost:9616. The hyper server is separate from the similar hyper server for the parachain part of the same collator. Both collators will race towards binding to 9616, resulting in a failure for one of them, which is not necessarily obvious when looking at the logs of the collator which can not bind. The relaychain minimal node's prometheus exporter is somewhat hidden and can be found in the logs if successful when setting it up, or through open ports exploration, but either way, not very well documented and straightfoward to refer to.

To consider for a possible fix (maybe others too)

~~1. Expose the relay chain prometheus exporter ports through a CLI arg to be able to set it to specific unused ports. This is relevant for: a. testing with multiple collators on the same machine. b. node operators that want to set the relaychain prometheus exporter port to something else than 9616. c. node operators/devs who don't know the relaychain prometheus exporter exists and find it by reading the code (or by reading the logs if its setup succeeds). If this is documented a bit more clearly in public docs this point can be ignored.~~ (We have support to pass a --prometheus-port flag to the relaychain part of the args received by a polkadot-parachain node already, which makes this point irrelevant).

2. We could also bind to a random unused port, don't act on 1 and let devs/operators rely on code reading and logs. It isn't very DX friendly. I would avoid this considering the first point would fix this the good way and it isn't very complex either. (First point is not relevant and this one doesn't matter anymore).

Emit an error when the prometheus exporter listener binding fails with port in use. This is trivial and can be done here: https://github.com/paritytech/polkadot-sdk/pull/5572. I will strike-through this part once we merge the PR.

Steps to reproduce

This will start only the relay chain minimal node's prometheus exporter of dave collator, while eve exporter fails with no way of telling why (and more problematic if not knowing that it should be started, given it can be found out only if reading the code* for the relay chain rpc interface setup).

zombienet -l text --dir 9 -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml

* I am not completely sure if the relaychain rpc interface's prometheus exporter startup is documented clearly outside of code though. (there is a way to pass relaychain args including the prometheus port when starting a polkadot-parachain node)

pepoviola commented 2 months ago

Hi @iulianbarbu, I think you can also pass the --prometheus-port flag to the relaychain part of the arguments (I will fix that in the native provider of zombienet). But also printing the error would be good. Thx!

iulianbarbu commented 2 months ago

Hi @pepoviola ! Thanks for jumping in. Would be great if you can reference this issue on the associated issue/PR on zombienet. I can also take a look/help with it once my bandwidth allows it, if not fixed by then. It is not urgent to be fixed in zombienet either, I've opened this ticket to clarify what's the best way to go around this default 9616 for prometheus exporter of the relay chain rpc interface (when managed through zombienet initially, and to a bigger extent if it applies to other usecases).

Separately, I think I just understood the codepaths related to configuring the prometheus params for the relay chain part. Somehow the relay chain args make their way up to PrometheusParams, where based on another prometheus_port flag the prometheus config can be given a custom port specifically for the relay chain, which is different from the one used by the node for the parachain part. I will update the issue by removing the first point because I think this should be quite obvious in parachain nodes deployments for node devs/operators.

What should remain on this ticket is the the error logging, and opening an issue on zombienet side. Do you agree?

iulianbarbu commented 2 months ago

What should remain on this ticket is the the error logging, and opening an issue on zombienet side. Do you agree?

Seems you already did it. Just the error logging then.

paritytech / polkadot-sdk