nebari-dev / nebari

🪴 Nebari - your open source data science platform
https://nebari.dev
BSD 3-Clause "New" or "Revised" License
274 stars 88 forks source link

Explore profiling tools for pipeline tracking and metrics gathering; #2422

Open viniciusdc opened 4 months ago

viniciusdc commented 4 months ago

Research possible frameworks and tools for CI and overall Nebari deployment to gather future insights into how Nebari currently performs.

This task can be worked in parallel, and we expect to compare notes and possible implications of each framework:

Possible options so far (can be extended further):

viniciusdc commented 4 months ago

So, we will start exploring some tools and brainstorming ideas; reducing source code changes and maintainability would be nice. (So that our future selves will be happy :smile: )

viniciusdc commented 4 months ago

Generate an outline of our findings:

marcelovilla commented 4 months ago

This issue is the first step towards addressing #2413. If we know what stages/services/resources are taking the longest time to deploy and destroy, we can identify current bottlenecks and work on solutions to improve our CI feedback time.

At a high level, Nebari uses Terraform under the hood to deploy all the infrastructure required to run a Kubernetes cluster, and then uses the Helm provider to deploy and configure different services inside the cluster (e.g., keycloak, jupyterhub, dask, etc...). At this moment, we just have a rough idea on how long the complete deployment and destruction steps take. Ideally, we should be able to get detailed information about each component involved.

Here are some relevant considerations before deciding what approach we will implement:

Here are some alternatives I see on how to implement this:

viniciusdc commented 3 months ago

That is wonderfull summary, @marcelovilla, and I completely agree with the alternative approach you came up with. I also bring up some considerations regarding your first questions, at least on my opinion:

Should this run inside our CI workflows or should we extend it so users can get detailed information on their deployment/destruction duration?

I would like for us to start with things that are easily attachable to our code (such as plugins or extensions) and start the have works first focusing on CI. I found this article about pyroscope that maybe can help generate meaningfully profile data without us interfering with the code itself https://pyroscope.io/blog/ci-profiling-with-pyroscope/

What kind of data granularity are we striving for? Is a per-stage breakdown enough or do we want to be able to identify up to individual resources?

That's an excellent task, and I don't think we will have an answer for this right away; I would instead replace this question with: what do we expect to get with these tests? is there any level of information that is valuable for us right now? if yes, how can we measure it?

Do we want to store profiling information over time? Will it be just available for a particular run, or will it be part of some kind of internal one-time exercise?

The only reason to store that comes to mind would be for reporting, which would be nice for presentations... but until we get a good grasp of how to interpret such data, this looks like a low priority for me.

Also, I was expecting simple reporting with for example execution time for each stage in a final report after the deploy for example, this has its own value and helps a lot when describing how long to wait

marcelovilla commented 3 months ago

@viniciusdc and I met to further discuss this and see what the next steps might be.

We both agree that the less code base changes and dependencies we introduce—at least for now—the better. With this in mind, we decided to leverage the fact that Terraform can produce plain text outputs with information about the duration of each stage creation/destruction. We'll work on making sure we can keep these files in a temporary folder when deploying/destroying Nebari so we can parse them (using tf-profile or our own custom parser) at the end of the Terraform apply process. We'll implement this logic inside a custom plugin using Nebari's extension system. This will allow us to keep both things apart, without introducing changes to Nebari's code base. We still need to decide whether this will run inside our local integration workflow, or even in the cloud providers deployment workflows.

Action items:

viniciusdc commented 3 months ago

Plugin work repo https://github.com/nebari-dev/nebari-tf-profile-plugin (POC)