go-vela / community

Community Information for Vela (Target's official Pipeline Automation Framework)
https://go-vela.github.io/docs/
Apache License 2.0
22 stars 3 forks source link

add official support for pipelines #460

Closed jbrockopp closed 2 years ago

jbrockopp commented 2 years ago

Description

I propose pipelines become a "first class citizen" which is simple in nature but the implementation would be substantial.

I can't pretend to know all the required steps to accomplish this but I'll list what I can think of:

  1. add a pipelines table to the database with a number of fields:

    • org - (string) the SCM org the Vela pipeline was obtained from
    • repo - (string) the SCM repo in the org the Vela pipeline was obtained from

    NOTE: The above fields could be replaced with repo_id like we do for the builds table

  2. add a pipeline_id field (int64) to the builds table that matches the id field from the pipelines table

  3. add a Pipeline{} type to the github.com/go-vela/types/database package

  4. add a Pipeline{} type to the github.com/go-vela/types/library package

  5. update existing API endpoints (/api/v1/pipelines) to support CRUD for pipelines

  6. update existing CLI commands (vela <action> pipeline) to integrate with the API endpoints

  7. update existing API (/webhook) endpoint to create pipeline for the ref from the repo

  8. update existing API (/api/v1/repos/:org/:repo/builds) endpoint to create pipeline for the ref from the repo

Value

Definition of Done

Vela handles pipelines like it does other data objects i.e. builds, repos, etc.

Effort (Optional)

2 weeks

Impacted Personas (Optional)

everyone

wass3r commented 2 years ago

should this be a proposal?

jbrockopp commented 2 years ago

@wass3r could you elaborate on that?

I ask because we don't really have any standards or guides on when we should or shouldn't write a proposal.

In the past, I've written proposals when decisions need to be made that impact the implementation for the functionality.

I didn't start with a proposal for this because AFAICT, there isn't really options to pick from for this work.

It's more about writing all the code necessary to deliver on this effort.

jbrockopp commented 2 years ago

A PR has been opened for this:

https://github.com/go-vela/types/pull/223

wass3r commented 2 years ago

I agree that the readme in the proposals directory is a bit vague with "... record of all potential features and modifications to be made to Vela." and by that definition it should have a ton more content. We can, and probably should, be more specific.

I've always viewed Proposals as RFCs for "bigger" features (which includes making decisions but isn't limited to that), so maybe that's my misinterpretation and a bit of word-smithing on what Proposals are and when they're needed is helpful.

This just seems like a substantial addition/modification without any prior discussion that i'm aware of (and I'm sorry if i missed it).

We can have that discussion here though for now. I have questions around the value proposition. Specifically, how the following value statements will be realized by this addition:

jbrockopp commented 2 years ago

I agree that the readme in the proposals directory is a bit vague with "... record of all potential features and modifications to be made to Vela." and by that definition it should have a ton more content. We can, and probably should, be more specific.

I've always viewed Proposals as RFCs for "bigger" features (which includes making decisions but isn't limited to that), so maybe that's my misinterpretation and a bit of word-smithing on what Proposals are and when they're needed is helpful.

Yeah, I think if that's the path we want to go down then more details should be added on proposals πŸ‘

To date, it hasn't been like that and has been more about the decision making process and providing options.

This just seems like a substantial addition/modification without any prior discussion that i'm aware of (and I'm sorry if i missed it).

AFAIK, there hasn't been any external discussions on this feature so you didn't miss anything πŸ˜ƒ

However, we've had discussions internally on this and it finally bubbled up to this story.

FWIW:

In the past, there have been major modifications made to Vela without proposals 🀷

We can have that discussion here though for now. I have questions around the value proposition.

Yeah that's a good call and is in the spirit of being an open source product πŸŽ‰

I'm happy to answer your questions πŸ˜„

  • less API calls to the SCM
  1. Today, when someone restarts a build, this leads to a few calls made to the SCM:

    If we stored pipelines in the database, I believe we could remove these calls to the SCM πŸ‘

  2. Additionally, today Vela supports a suite of API endpoints for pipelines.

    Most, if not all, of these endpoints send calls to the SCM to fetch the pipeline configuration.

    If we stored pipelines in the database, I believe we could remove these calls to the SCM πŸ‘

Unfortunately, I don't have any data or analytics to support how many calls to the SCM we'll be removing 😞

  • enable auto-scaling for workers based off pipeline metrics

Internally, we use Vela's build queue and metrics API endpoints for autoscaling workers.

Recently, we've started to need specialized worker configurations internally for various reasons πŸ˜…

Vela's worker YAML key works well to accomplish this use case for the users of the product.

However, we don't have a great way of identifying what builds are pinning to these workers without scraping the SCM.

This makes it very difficult for us to properly autoscale these customized workers until this information is available.

If we stored pipelines in the database, I believe we could easily identify what builds are pinning to these workers πŸ‘

This would lead to us being able to properly autoscale workers with the customized configuration πŸ˜ƒ

wass3r commented 2 years ago

We'll try and clear up any potential confusion about proposals and try to be better on the consistency aspect.

Re:

1. Today, when someone restarts a build, this leads to a few calls made to the SCM:
    - fetch the changeset for the build
    - fetch the Vela pipeline config for the build - can be multiple calls if an error is returned from the SCM

If we stored pipelines in the database, I believe we could remove these calls to the SCM πŸ‘

2. Additionally, today Vela supports a suite of API endpoints for pipelines.

Most, if not all, of these endpoints send calls to the SCM to fetch the pipeline configuration.

If we stored pipelines in the database, I believe we could remove these calls to the SCM πŸ‘

I'm all for going easy on the SCM. Without data to back me up (like you said as well), I'd venture to guess that the impact for these scenarios would be minimal.

It seems like the core driver then is related to autoscaling.

If we stored pipelines in the database, I believe we could easily identify what builds are pinning to these workers πŸ‘

Couple of questions/concerns there:

  1. Is the idea to continuously poll and parse pipelines and react to them somehow?
  2. Any concern about "data rot"? In most cases only the latest version of the pipeline is needed and most previous versions are more or less useless.
  3. What alternatives were considered? a. could we amend the Build data structure with a route or worker field? b. or have vela send webhooks to a configurable, separate endpoint that deals with scaling when new work comes in? c. others?

I know there are many ways to do proactive and reactive autoscaling but i think they typically involve looking at worker idle count (and/or scale factor) in some way (which could be done outside of Vela) and the queue state, which sounds like you are more or less utilizing. You could get a little more info on the queue items with 3a. above which seems much more lightweight than the proposed solution, but may not hit all your checkboxes?

Hope the questions make sense. Just trying to understand better.

jbrockopp commented 2 years ago

I'm all for going easy on the SCM. Without data to back me up (like you said as well), I'd venture to guess that the impact for these scenarios would be minimal.

I suppose its all relative to the company's Vela installation and how it's used 🀷

Restarting a build is a very common need in Vela based off the users I've seen using it πŸ˜…

Also, we've been progressively getting more feature-rich with pipelines and support for them πŸ‘

As our needs and use cases with pipelines continue to grow, so will the cost to the SCM.

It seems like the core driver then is related to autoscaling.

I'm not sure that's accurate based off the points under the Value section in the original comment πŸ˜…

At least, I feel that this functionality provides more value to the product than solely for autoscaling purposes.

  1. Is the idea to continuously poll and parse pipelines and react to them somehow?

It's potentially too soon to tell since we haven't fully implemented that functionality yet 😞

That being said, the idea we tossed around was being able to query the Vela database.

Ideally, we can get a list of running/pending builds and the worker pipeline info for these builds πŸ˜ƒ

If this was with a pipelines table, this would involve a JOIN between the tables to gather this information.

Then, we could use that information accordingly to know which pools of workers we need to autoscale πŸ˜„

  1. Any concern about "data rot"? In most cases only the latest version of the pipeline is needed and most previous versions are more or less useless.

Always 😞 Unfortunately, that's a problem for most, if not all, resources that we haven't addressed yet.

You can apply these concerns to almost all of the resources we currently store and maintain today.

  1. What alternatives were considered?

We came up with a couple different options and ultimately ended up on this one πŸ‘

a. could we amend the Build data structure with a route or worker field?

Yes, that could be done and would help solve our needs for autoscaling πŸ˜ƒ

It was one of the options we discussed, but it doesn't address all the value statements this functionality delivers on.

b. or have vela send webhooks to a configurable, separate endpoint that deals with scaling when new work comes in?

We never really talked about this one but I suppose it's certainly possible πŸ‘

Although, I'm not excited about the idea of requiring another external service to send webhooks to for Vela.

c. others?

The other option we discussed was to tax the SCM and lean into the existing functionality we have today for autoscaling.

Since we have the ref from the build, we could use that with the org and repo to fetch the pipeline from the SCM.

However, this felt like the wrong approach since it would incur even more API calls to the SCM 😞

I know there are many ways to do proactive and reactive autoscaling but i think they typically involve looking at worker idle count (and/or scale factor) in some way (which could be done outside of Vela) and the queue state, which sounds like you are more or less utilizing. You could get a little more info on the queue items with 3a. above which seems much more lightweight than the proposed solution, but may not hit all your checkboxes?

Correct, 3a does not provide all of the value that we're looking to deliver on with this feature.

wass3r commented 2 years ago

Sorry for delayed follow-up. I was out for a bit.

| b. or have vela send webhooks to a configurable, separate endpoint that deals with scaling when new work comes in?

We never really talked about this one but I suppose it's certainly possible πŸ‘

Although, I'm not excited about the idea of requiring another external service to send webhooks to for Vela.

That must be specific to your installation? I'm not aware of Vela sending any hooks by default? I was also thinking that this would not be enabled by default since there are other ways to scale workers.

The other option we discussed was to tax the SCM and lean into the existing functionality we have today for autoscaling.

Since we have the ref from the build, we could use that with the org and repo to fetch the pipeline from the SCM.

However, this felt like the wrong approach since it would incur even more API calls to the SCM 😞

Agreed.

Re:

| It seems like the core driver then is related to autoscaling.

I'm not sure that's accurate based off the points under the Value section in the original comment πŸ˜…

And:

| a. could we amend the Build data structure with a route or worker field?

Yes, that could be done and would help solve our needs for autoscaling πŸ˜ƒ

It was one of the options we discussed, but it doesn't address all the value statements this functionality delivers on.

I haven't brought the other stated value points up thus far. Mostly because it's not clear to me how they add value the way they are presented (or maybe i, and maybe others, just need extra help):

It's sort of saying (using the version key as an example): we will collect the version key and the value is we now have the version key. What is the value that provides? Maybe there was/is more discussion or context around these that clarifies?

That's partially why I focused on the other points where the intended value proposition is more clearly spelled out. Specifically auto-scaling as I still tend to think that reducing SCM calls in the ways described will have a relatively minor impact (but I could get onboard with "every little thing helps").

Consequently, for mostly providing easier ways to auto-scale, the proposed changes seem disproportionally complex. Not trying to be a stick in the mud, but making sure I, and maybe others, understand what this change will provide since it is a "substantial" addition.

One other clarifying question, and I may know the answer. data would contain the expanded pipeline?

jbrockopp commented 2 years ago

| b. or have vela send webhooks to a configurable, separate endpoint that deals with scaling when new work comes in? We never really talked about this one but I suppose it's certainly possible πŸ‘ Although, I'm not excited about the idea of requiring another external service to send webhooks to for Vela.

That must be specific to your installation? I'm not aware of Vela sending any hooks by default? I was also thinking that this would not be enabled by default since there are other ways to scale workers.

Yeah, I meant the external modification service which is embedded in the compiler πŸ˜… :

https://go-vela.github.io/docs/administration/server/reference/compiler/

I misspoke when I send "send webhooks to".. it's more accurate to say we send webhook data to that service πŸ‘

Re:

| It seems like the core driver then is related to autoscaling. I'm not sure that's accurate based off the points under the Value section in the original comment πŸ˜…

And:

| a. could we amend the Build data structure with a route or worker field? Yes, that could be done and would help solve our needs for autoscaling πŸ˜ƒ It was one of the options we discussed, but it doesn't address all the value statements this functionality delivers on.

I haven't brought the other stated value points up thus far. Mostly because it's not clear to me how they add value the way they are presented (or maybe i, and maybe others, just need extra help):

  • enable data collection and retention on pipelines
  • enable Vela admins to determine what customers are using the version YAML key
  • enable Vela admins to determine what customers are using the worker YAML key
  • enable Vela admins to determine what customers are using the templates YAML key

It's sort of saying (using the version key as an example): we will collect the version key and the value is we now have the version key. What is the value that provides? Maybe there was/is more discussion or context around these that clarifies?

Data collection and retention is mainly an overarching point of the other pieces, but I'll attempt to elaborate πŸ˜ƒ

Retention

The value behind retention of pipelines is more obscure in nature if we ignore the reduction of API calls to the SCM.

Barring that, I've only seen it valuable in scenarios related to security events πŸ˜…

To summarize, there have been times where the Security org for an enterprise has requested to be provided with the exact pipeline that was executed for a specific build. At that time, no such functionality was present in Vela so we were unable to grant the request directly. This required exploring alternatives that were used but it certainly wasn't a great solution.

Collection

On the other hand, data collection feels easier to illustrate because discussions have occurred in the past on this πŸ˜ƒ

By collecting pieces of info on pipelines, I might know where to best spend my efforts as an admin of an installation πŸ‘

For example, if no repos are using templates in their pipelines then perhaps thats a technical gap I can attempt to solve?

The methods used to solve that gap will vary in nature (e.g. code contribution, docs, enterprise provided templates etc.)

But, by knowing what users are and aren't using, I can prioritize my time based off my installation's usage πŸ˜„

This same approach can be reused for other resources we're collecting data on (e.g. services, stages etc.)

The value of knowing what pipelines are using the worker key actually came up while in my previous role πŸ˜…

The two companies I've seen running Vela as an enterprise offering have setup the clusters pretty similar:

Both companies have the specialized worker pool be significantly smaller due to cost and resource demand πŸ‘

Unfortunately, Vela doesn't offer any kind of mechanism to prevent repos from using the worker key.

Additionally, Vela doesn't offer any mechanism to even show what repos are using the worker key.

This leads to a bad experience for both admins and users if we can't see this information i.e. not enough workers available

I'll admit the version key doesn't provide much value today since everyone should be using 1 😞

However, when we attempt to explore adding a version 2 syntax I would see this as valuable.

I can see a new version being required when something like https://github.com/go-vela/community/issues/9 gets implemented again πŸ˜…

One other clarifying question, and I may know the answer. data would contain the expanded pipeline?

Making the assumption that when you say expanded, you are referring to templates then the answer is no 😞

The raw pipeline, meaning whatever raw file thats fetched from the source provider, will be stored in the data field πŸ‘

This means the data field will include the contents of one of the following files (depends on configuration):

kneal commented 2 years ago

I think all these discussions are super valuable but I'm a little curious what the actual concern is with the feature? Reading the thread the concern sounds like just general complexity? But I can't quite pinpoint why we wouldn't want this feature reading the thread.

enable Vela admins to determine what customers are using the X YAML key

I do think a lot if not most of the value for this point is to allow us, admins, to improve the experience for users via a metrics approach because we finally have a simplified way of getting that data. How admins chose to get that data whether it's through the metrics endpoint or a new admin one I don't think it matters. The fact is that they finally can get the data that can be valuable for runtime or feature decision-making around prioritization. Also, the data is not exclusively valuable for seeing what YAML tags are popular but even seeing data patterns for the values at those tags. For example, maybe, the default value we picked for pull_policy is actually what customers think is popular or valuable. Or maybe even it shows a customer education point we never could have known about before.

enable auto-scaling for workers based off pipeline metrics

I honestly just see this as a data point to the above details. So, I don't really think it's something to focus on IMO. I think if @jbrockopp's installation chooses to focus on that data that's valuable for them but maybe not for others since they don't autoscale.

less API calls to the SCM

I think this just logically makes sense and we've had conversations in the past with teams about use cases around pipelines that don't even involve an SCM. Why I think those use cases might be a bit further down the road I think this type of design could really set us up for success in this area when it comes to accomplishing those workflows.

Would also be curious how we could improve the /api/v1/pipeline endpoints with native data. Those also depend on SCM calls

jbrockopp commented 2 years ago

A PR has been opened for this:

https://github.com/go-vela/server/pull/574

wass3r commented 2 years ago

@jbrockopp thanks for the additional context and patience. super helpful. I think there was a lot hidden in those few bullet points. glad some of it was spelled out more.

One other clarifying question, and I may know the answer. data would contain the expanded pipeline?

Making the assumption that when you say expanded, you are referring to templates then the answer is no

Is this because templates have life of their own, more or less, and unless we are so lucky that a user pinned it to specific commit, it wouldn't make sense to cache it? Or is there another reason? In the spirit of saving calls to the SCM, this seems like a good additional win. I'm definitely aware of numerous pipelines with multiple templates in play. Might need some prerequisite work for it to work that way though. 😞

If we ever wanted to go down that path, maybe it makes sense to add a field now.. something like expanded (boolean), so we know we have to fetch templates. I guess we could only set that to true if all templates were pinned somehow?

Another question with regard to the data field and I would be ok with: we will deal with it when we get there (if we do). There was a feature request (I couldn't find it on the github.com board though) to support being able to define your pipeline through more than one file, sort of like what GitHub Actions and other CIs allow you to do. Would you envision this design changing if that were to be in place?

@kneal

I think all these discussions are super valuable but I'm a little curious what the actual concern is with the feature? [..] But I can't quite pinpoint why we wouldn't want this feature

I can see how it might look like there is concern with the overall feature, though I don't think I ever said I don't want the feature.

When a "substantial" feature is being pushed through out of the blue, without much context or history, we would be doing a disservice if we didn't give it some scrutiny, especially when the value proposition is not apparent or seemingly off balance with the amount of work proposed. In this case, I think the value-add was maybe just undersold and I'm thankful that both Jordan and you added some color. It was helpful to see some examples on how how the data might be used. It helps strengthen the value proposition. Also appreciated the other thoughts and context that was driving the feature in the first place.

I will admit - I'm not super excited about creating a file repository in the DB, although it will be "small fries" compared to logs. (however, I have seen some lengthy pipelines that some might mistake for logs πŸ˜†). I don't know if there is a great alternative at the moment, however. Getting at the data en mass, especially from the data field, seems a little cumbersome, so having some select data bubble up to be more easily query-able is nice πŸ‘πŸΌ

Speaking of that, @jbrockopp - what about adding a boolean on whether the pipeline uses secret plugin(s)? I wouldn't do it for plugins generally, but secrets are a little special, I think. If for nothing else, it would be nice to understand usage for security purposes and be helpful for filtering out pipelines to further investigate etc.

Would also be curious how we could improve the /api/v1/pipeline endpoints with native data. Those also depend on SCM calls

Good call πŸ‘πŸΌ I imagine we would leverage this "cache" of pipelines wherever possible.

jbrockopp commented 2 years ago
One other clarifying question, and I may know the answer. data would contain the expanded pipeline?

Making the assumption that when you say expanded, you are referring to templates then the answer is no

Is this because templates have life of their own, more or less, and unless we are so lucky that a user pinned it to specific commit, it wouldn't make sense to cache it? Or is there another reason? In the spirit of saving calls to the SCM, this seems like a good additional win. I'm definitely aware of numerous pipelines with multiple templates in play. Might need some prerequisite work for it to work that way though. 😞

If we ever wanted to go down that path, maybe it makes sense to add a field now.. something like expanded (boolean), so we know we have to fetch templates. I guess we could only set that to true if all templates were pinned somehow?

Yeah, templates having a lifecycle of their own is part of the reason why we don't store them in the data field πŸ˜…

Another reason has to deal with the modification service which is embedded in the compiler 😞

Today, when a build is restarted in Vela, we fetch the pipeline from the repo and compile it πŸ‘

During this, the templates are fetched and expanded as well as the pipeline is sent to the modification service.

The problem is that both the templates and the modification service could have changed after that build ran 😬

So imagine running a build, then changing either the templates or modification service and restarting that build.

You can end up with different behavior which may not be great, but it is the way we have things setup today πŸ˜ƒ

Like you said, we could add more complex logic to determine when we do or don't store the pipeline expanded under data.

However, this functionality feels out of scope for MVP and probably better suited as an enhancement πŸ˜…

Another question with regard to the data field and I would be ok with: we will deal with it when we get there (if we do). There was a feature request (I couldn't find it on the github.com board though) to support being able to define your pipeline through more than one file, sort of like what GitHub Actions and other CIs allow you to do. Would you envision this design changing if that were to be in place?

Unfortunately, I'm not sure I have a great answer at the time due to implementation changing the behavior 😞

For example, let's assume we allow multiple pipeline files to be stored in a .vela folder in the root of the repo.

Then, I see two potential options on how we could implement this with the pipelines table:

That being said, I'm sure there are more options we could utilize and explore for this so I'm not sure πŸ˜…

I will admit - I'm not super excited about creating a file repository in the DB, although it will be "small fries" compared to logs. (however, I have seen some lengthy pipelines that some might mistake for logs πŸ˜†). I don't know if there is a great alternative at the moment, however. Getting at the data en mass, especially from the data field, seems a little cumbersome, so having some select data bubble up to be more easily query-able is nice πŸ‘πŸΌ

Yeah, since we'll be compressing the data field before storing in the database it should be fine πŸ‘

The largest pipeline I've ever seen was ~ 15k lines which would be a drop in the bucket compared to logs I've seen πŸ˜…

Speaking of that, @jbrockopp - what about adding a boolean on whether the pipeline uses secret plugin(s)? I wouldn't do it for plugins generally, but secrets are a little special, I think. If for nothing else, it would be nice to understand usage for security purposes and be helpful for filtering out pipelines to further investigate etc.

So the idea is to add a field to the pipelines table that tracks if someone uses the origin key under secrets? πŸ‘

If so, I don't have any concerns with adding that in πŸ˜„ honestly I'd probably end up adding fields to track both:

wass3r commented 2 years ago

Another reason has to deal with the modification service ...

Ah yes, of course. Forgot about that 🀦🏼

Like you said, we could add more complex logic to determine when we do or don't store the pipeline expanded under data.

However, this functionality feels out of scope for MVP and probably better suited as an enhancement πŸ˜…

Fair enough. Was just trying to squeeze out more SCM API call savings.

Then, I see two potential options on how we could implement this with the pipelines table:

  • each file content would be a unique row with a data entry in the pipelines table
  • all files would be "merged" into a single document and that file is stored as the data entry in the pipelines table

That being said, I'm sure there are more options we could utilize and explore for this so I'm not sure πŸ˜…

Sure, I was mostly curious to see if we can avoid a major overhaul that might be more painful to do after the fact. Something to keep in mind I guess. I agree that even if the multiple file scenario was in place, we should be able to accommodate with this design.

If so, I don't have any concerns with adding that in πŸ˜„ honestly I'd probably end up adding fields to track both:

  • internal_secrets - (boolean) - indicates if the pipeline contains internal secrets
  • external_secrets - (boolean) - indicates if the pipeline contains external secrets

I like it. Thanks!

jbrockopp commented 2 years ago

A PR has been opened for this:

https://github.com/go-vela/server/pull/615/

wass3r commented 2 years ago

could definitely be an add-on enhancement, but mentioning it here if we wanted to incorporate a piece of it in the DB now.

not sure if there's been discussion on this, but what about capturing and storing the ETag for the configuration from the SCM (for those that support it). we could then make a conditional request (in github, for example: https://docs.github.com/en/rest/guides/getting-started-with-the-rest-api#conditional-requests) to see if the pipeline changed and if not, use the previous pipeline's data.

it wouldn't be efficient in all scenarios, but arguably a common scenario is: many initial pipeline changes that taper off into mostly non-pipeline changes. doesn't seem to make sense to keep asking for the same pipeline when we already have it.

wass3r commented 2 years ago

has all the work for this completed then? can we close this?

jbrockopp commented 2 years ago

@wass3r I left it open because I haven't completed the documentation for this yet.

However, I could be on board with the idea of closing this since the actual functionality is there.

I have some docs changes locally that I'm working on so I hope to get a PR submitted this week for that.

jbrockopp commented 2 years ago

@wass3r https://github.com/go-vela/docs/pull/300 should close this issue when merged πŸ‘