Self test tooling to validate cloud storage

andrwng commented 1 year ago

Who is this for and what problem do they have today?

Anyone trying to deploy Redpanda may find it difficult to validate that they've set up their cloud storage configs/accounts properly.

What are the success criteria?

We should add a test to the existing self-test framework that runs a simple cloud storage workload on every node that exercises everything Redpanda will need to do: put, get, head, list, delete, delete multiple. The output should show for each category of operation whether it was successful, any error code, other interesting metrics (e.g. latency), etc.

Why is solving this problem impactful?

Today, the only ways to verify the credentials in a cluster work as expected are to:

create a tiered storage topic and watch the logs/metrics for failed uploads
use external tooling per cloud provider that emulates exactly what Redpanda does

Both of these require some specific knowledge of Redpanda, or of the underlying cloud provider, to get right.

Additional notes

To start, this doesn't need to be high throughput or high volume at all, though perhaps those should be tunable -- at a minimum, the goal should be to make it quick and painless to verify that a cluster's credentials are configured properly.

JIRA Link: CORE-1198

graphcareful commented 1 year ago

The existing self-test infrastructure would be great for this kind of task. It was designed to run tasks on all nodes that don't have any inter-dependencies. The already existing disk test works the same way.

To integrate this with what exists today you'd need to:

Write the logic above to do this, in cluster/self_tests
Create a new test-type, and hook up the logic here
Hook up remaining portions to admin-api / rpk.

The self-test-request type is a struct that contains lists of well defined test parameters. So you'd need to just add another entry there and have rpk or pass the inputs to that

daisukebe commented 1 year ago

Let me clarify; does We should add a test to the existing self-test framework mean we're supposed to create a tool for an internal purpose at Redpanda? Otherwise, is it going to be kinda application any users can run before deploying TS?

jcsp commented 1 year ago

Let me clarify; does We should add a test to the existing self-test framework mean we're supposed to create a tool for an internal purpose at Redpanda?

The framework is the new self-test that's built into Redpanda clusters since 23.1 internal-wiki://..... spaces/CORE/pages/332300353/Redpanda+Self+Test -- currently it checks network + local disk, this would extend it to also test cloud storage.

jcsp commented 1 year ago

:+1:

This is a great idea. My only extra thought is that it needs a readonly mode and a read/write mode, as a read replica system would usually be configured without write access to a bucket.

daisukebe commented 1 year ago

The framework is the new self-test that's built into Redpanda clusters since 23.1

Aha, that's great! Linking to a public doc for a reference, https://docs.redpanda.com/docs/reference/rpk/rpk-cluster/rpk-cluster-self-test/

mattschumpert commented 1 year ago

Great idea indeed. When the test is performed, we should indicate exactly what credential/principal is being used for the test and from what source (e.g. access keys/secrets vs an IAM role found in instance metadata), to help users debug any permissions issues or confusion from configuring both by accidents: access keys in cluster configs and IAM roles when instance metadata is used as credential source.

mattschumpert commented 1 year ago

@dotnwat sizing please

andrwng commented 6 months ago

Validation without actually testing operations against the service will only get us so far. For instance, in the incident that led to this issue, the problem ended up being a storage account not having the correct permissions, and it took a great deal of time to conclude that it wasn't an issue with Redpanda's cluster configs.

That isn't to say validation of the credential configs doesn't also make sense. What specifically do you have in mind?

emaxerrno commented 6 months ago

@githubexplorer38237213271 this feels like an AI bot.

my recommendation is to use BLUF - bottom line up front - when doing online comms.

Your comment has no specific insights and probably not a fit for the core team.

andrwng commented 6 months ago

@githubexplorer38237213271 thanks for spending the time going through the code base. Note that this issue describes improvements to a diagnostic tool, not adding unit/integration tests. To Alex's point, this ticket should focus on that, rather than debating its utility.

Please channel job inquiries through the appropriate channels.

emaxerrno commented 6 months ago

@emaxerrno

My comment was not written by an AI bot. Is your comment generated by an AI bot? I have five other past GH profiles also.

Apologies if something in my comment offended you.

I spent two plus hours going through all of the test/product code manually.

I was suggested to contribute here as a data point for the team by another team member.

If contributions are unwanted, apologies.

I have also studied computer/data science previously, can share 1 certificate presently if needed.

Outside of this I have lots of past work experience. Is there another team that needs more help outside of the core team?

Yes. Your comment has no insight and reads like it was generated by chat gpt. My recommendation to BLUF stands.

I’d like to keep the rest of this ticket technical.

redpanda-data / redpanda