gocd / gocd

GoCD - Continuous Delivery server main repository
https://www.gocd.org
Apache License 2.0
7.12k stars 973 forks source link

Feature: pipeline configuration from source control #1133

Closed tomzo closed 8 years ago

tomzo commented 9 years ago

This feature made it! The documentation in GoCD is here.

I have collected most notable information from the comments below so that no one has to read all that again to get an idea what has changed in the system and what is expected from it.

Overview of the feature

I have prepared example config repositories. In order of complexity:

  1. https://github.com/tomzo/gocd-main-config contains main cruise XML configuration. The one stored in /etc/go/cruise-config.xml
  2. https://github.com/tomzo/gocd-indep-config-part - XML configuration part with no external references.
  3. https://github.com/tomzo/gocd-refmain-config-part - XML configuration part that refers to pipelines from main.
  4. https://github.com/tomzo/gocd-refpart-config-part - XML configuration part with references to other configuration part repository
  5. https://github.com/tomzo/gocd-json-config-example - JSON configuration part

In main config https://github.com/tomzo/gocd-main-config there is config-repos branch with config-repo sections to import elements from the other repositories.

Domain and concepts

Configuration repository

Configuration repository is a source control repository of any kind that holds part of Gocd configuration. So far we referred to this as config-repo or partial. However 'partial' should really be reserved for the object of configuration. While repository is the remote code, yet to be fetched and parsed.

ConfigOrigin

Tells where some part of configuration comes from. It was necessary to add, because now some services need this extra info to operate. There are 3 types of configuration origins:

There are 2 scopes of configuration:

These are important at system level because we consider validity twice, first at base scope, then at merged.

Behavior and assumptions

When pipeline is defined in configuration repository, there are always 2 cases which actually define how Go server should behave.

When configuration repository that defines the pipeline is the same as one of materials

In automated builds we expect that when pipeline is triggered with material at revision C1, then configuration of the pipeline will be from the same commit - C1. There is a small (unavoidable) inconsistency here - when there are few quick commits (C1, C2, C3) made, that change pipeline configuration, then Go may pick them up faster than finishing already running builds (E.g. Configuration has been updated to C2, when stages on C1 are is still running). It may lead to failing a build that would have passed if the commits were slower. However IMO this is good after all, the quick commits usually would be done because somebody wanted to fix the previous configuration. There is no way to avoid it because only one pipeline configuration can exist at a moment.

In manually triggered builds Go always fetches materials first, which may change the configuration of pipeline that we just triggered.

In timer triggered builds Go also fetches materials first, which may change the configuration of pipeline that is being triggered.

This case is much less complex. Go is always polling for changes in configuration repositories and tries to merge them to current configuration. The rules are the same as if the incoming changes were done from UI.

Failures

Hung material

What happens when one material polling gets hung:

When plugin fails or configuration has invalid format or migration fails in configuration repo checkout then material update completes but config partial is old.

How to handle merging configuration parts and main configuration?

  1. Merges are done at object-level. (Meaning first all XML and all repositories are parsed to create BasicCruiseConfig and PartialConfig, then an aggregate object is created - BasicCruiseConfig with merge strategy)
  2. According to rules written below

    Environments

    Pipelines in environment

Most liberal approach possible:

Most liberal approach possible:

There could be optional overrides but we can consider it future work.

Pipelines

Authorization can be only in main xml so it cannot conflict when merging.

System

Some notes about changes in how Go services work and what is happening when configuration repositories are present.

Services

Here is a summary of new services layout:

Below GoConfigService

The best analogy to get the whole point here is that MergeGoConfig has replaced the old CachedGoConfig. It used to be that CachedGoConfig had 2 instances of configuration in memory (for edit and current config). Now there is MergeGoConfig that has these two. But main difference is that MergeGoConfig may return merged configuration as current config or for edit. If there are no extra configuration parts then it returns the main configuration.

Above GoConfigService

This is implemented mostly how we discussed here

New material update queue

Added new component - ConfigMaterialUpdater which listens on config-material-update- completed topic. So when MDU is done then ConfigMaterialUpdater gets its chance to work with material being updated:

The checkouts (in pipelines/flyweight) are NOT done/updated by standard material pollers when doing update on db (MDU).

But now there is new type of poller that creates full checkout on each update. These directories are now read and parsed by configrepo plugins.

Handling edits

Merged cruise config is returned for edits. When some service is editing the config it does not know if the config is merged or not. It does not have to know.

Adding

When method to add pipeline or environment is made then it reaches merged cruise config at some point. It is then aware that we meant to add in the main part and changes the main config instance (inside the merge cruise config instance).

Removing

Removing is like adding. We can localize where to remove from. If user tries to remove remote element then it fails. Usually it would fail in the cruise config code.

Modifications

Modifications get complex because there are many ways in which they are introduced. This is where there is real benefit from returning merged config instance. Changes are made on the config instance in full merged context so that when anything invalid is attempted then it will throw. E.g. when trying to change name of pipeline group defined remotely.

Saving changes

Each config edit ends with attempt to save some config for edit instance (or deep clone of it, or clone of a clone, etc.). To deal with that - magical writer is aware of possibility that merged config might be passed to be serialized. If so then it takes out only locally defined configuration elements. Actual extraction of local elements is implemented in config-api and it is very easy because we keep and maintain the main configuration instance inside merged config anyway.

Pull requests

These are either merged or planned pull requests to make all above work:

Being a big fan of keeping all project-related code in its source code repository I would really like to be able to declare pipeline configuration in the source code of each individual project instead of the global cruise-config.xml. Many people will agree that each project's code should know how to build and test itself. Following this concept it should also know how to CI-itself.

Problem

Currently when all go configuration is in global configuration file on server we basically end up with 2 sources of projects configuration - one being git repository, the other a file on go server. There are lots of scenarios when new changes in git repo will cause the build to break because they expected different pipeline configuration. Or rather pipeline configuration expected the older git repo contents.

Concept

In order to avoid such conflicts probably the <pipeline> section should never be in the global cruise-config.xml, instead go-server should configure pipelines after pooling from source repositories.

Final notes
tomzo commented 9 years ago

Thank you for feedback and answers. I'll have to think on those migrations some more. [I'll respond to the rest of questions in the morning]

jyotisingh commented 9 years ago

Long thread! Apologies for being really late to the party. At this point I am just trying to understand the proposed solution and have a few questions regarding the implementation, just my two cents.

Haven't you noticed that updating pipeline configuration triggers a built just like it would in case of a new commit in material? I think this is a hidden symptom of not entirely true domain model.

Not entirely true. A new build would be triggered only when a change is made to the materials that the pipeline depends upon. Adding a new stage or editing an existing job does not trigger a new build. Making config as a material for the pipeline would cause the pipeline to trigger for all possible changes in the pipeline config, unless you also intend to implement some sort of filter/blacklist for config-materials. This is a deviation from what happens today.

Also, a pipeline can have 'n' number of materials defined. Now that we treat config-repo as another material for the pipeline, there is no way to dictate update for which material happens first. For example, consider a pipeline depends on a config-repo(say git), and a TFS repo. Lets say I committed to the config-repo, and also made a commit to the TFS repo. If TFS repo is picked for an update first, and the pipeline would be scheduled with the commit. In next cycle the config-repo gets updated, and then the pipeline triggers again. There is a wasted build here. If we change the code to not build again using the same revision, then the pipeline would not run with the updated config. Did I make any sense there?

A rerun of a stage or job will re-use the config for that time. … Either way tying code to config in the same repository should solve the problem we were talking about (scm-consistency).

This was mentioned in the context of stage/job rerun. What happens in case of trigger-with-options? Do we use the version associated with the chosen revision or would it be the latest config? What happens in case of timer trigger? Does the build run with latest config or the one associated with the revision with which the build would be triggered? The latter could have unintended repercussions.

Anyway my recommendation to using this feature is to keep all inter-dependent pipelines in single config repository so that they can be fully validated before pushing.

Would all such pipelines be stored as separate files in the repo and do they all go to the same file? Inter-dependencies can get massive in larger setups, and @mrmanc is not alone in having ~700 inter-dependent pipelines.

What we can do, from a Go config migration side is to try and provide good defaults, so that a config does not necessarily become invalid automatically. That's the best I can see us being able to do

I am not sure about what can be considered as a good default for a material or stage config, or a fetch-artifact for that matter.

In general, I tend to side with @mrmanc on maintaining the integrity of Go's config as not doing so will lead to many cases to be handled, such a handling scheduling, fanin-fanout, vsm, APIs etc. Also, I am vary of the config-repo as material concept or even switching between config versions based on the selected revision, as I think it will introduce many unnecessary scenarios which does not concern the original goal. Externalising config for a pipeline be it to a file-system or an external repo seems like something that would benefit many. If a repo is used to store the config, in my opinion, this repo shouldn't be considered as a regular material for the pipeline. To make life less complicated, I would stick to using the latest config for auto/manual/timer triggers or even re-runs. Since, the config now sits in a repo that users could manage themselves, they have a liberty of reverting to an older config when required.

mrmanc commented 9 years ago

@jyotisingh said:

Making config as a material for the pipeline would cause the pipeline to trigger for all possible changes in the pipeline config, unless you also intend to implement some sort of filter/blacklist for config-materials. This is a deviation from what happens today. … Also, I am vary of the config-repo as material concept or even switching between config versions based on the selected revision, as I think it will introduce many unnecessary scenarios which does not concern the original goal.

I think that in the special case that the external config only defines what the (single) pipeline should do with the code contained in the same repo, then you absolutely want a change to the pipeline config to trigger the build, so that you fulfil the goal of failing fast. I don't want to later discover that someone else's pipeline change has caused the pipeline to break when I commit a change to the application code.

For example, if I have some pipeline instructions checked in alongside my project, and my pipeline only has this repo as a material, then adding a task to the pipeline instructions to match some new application code I've introduced would cause a build to happen, which I would want.

In this case the pipeline change and the application code would be atomic in the same commit of the same repo. This special case seems simpler to implement, and typical of the use case of externalising the pipeline definition. This seems to me to make it a good candidate to inform our choices about how anything more complicated should work, but again I wonder if we need anything else.

This scenario as a minimum viable product should fulfil what I (possibly incorrectly) see the goal of this work as: allowing people to edit the pipeline upstream of Go itself so that they don't have to interact with the slow GUI pipeline screens.

Am I correct that that is the goal? Only it can't be to improve performance because as @tomzo has pointed out the config will still need serialising and persisting all together on the Go server.

tomzo commented 9 years ago

I am really sorry that this is thread is so long. But problem is just big and as it seems there is as many opinions as users. Thank you all for participating

Many of you oppose the feature of re-running using older configuration. I'd like to note that so far thought this is what everybody would want, so this is a little surprise for me. I will consider all points mentioned above since these are all very good.

I also realized that by proposition of config material and enforcing 'scm-config consistency' even in reruns I am forcing a new behavior in Go. Which is not desired by some users. Because from what I understand - there is some group of users that would only want to keep parts of config in separate repositories, optionally in another format while maintaining the concept of single, latest source of truth.

Implementing configuration repositories with support only for latest configuration would be much easier than what I planned to do so far. I need to keep the concept of latest, valid, global configuration anyway. So I think that I will make this a default behavior - as @jyotisingh said:

I would stick to using the latest config for auto/manual/timer triggers or even re-runs.

Would it satisfy those users that opted for using latest configuration?

But I would make it optional (globally or per config-repo) to behave otherwise according to rules I posted already (reruns with old config, pipeline config as configuration material). Which leaves the problem of migration - in case mentioned by @arvindsv

If version 75 introduces a new field in the config called "blah", then the API between the config repo plugin and the Go server will include "blah" as a field. Now, if the user reruns a version 70 build, then the plugin has to somehow migrate the config from version 70 to 75, and put in a "blah" field in the response.

IMO the most reasonable solution is:

It user is desperate to re-run that stage then he/she will need to update plugin so that it handles migration better. After all, this case happens only when Go was upgraded right? So this is not that often. My gut is that there won't be that many breaking changes. Plugin is expected to return just pipeline and environment configurations, so any other change in main xml schema will not cause plugins to break. The plugin-Go server API part is true though.

Replies

This causes a behavioral inconsistency. If you rerun a pipeline which is defined in the main config, then it will rerun with latest config. However, if you rerun a pipeline defined in a repository, it will rerun with old config.

I was aware of that. My thinking was that it is acceptable.

What we can do, from a Go config migration side is to try and provide good defaults, so that a config does not necessarily become invalid automatically. That's the best I can see us being able to do. What do you all think?

I agree. But I would add to that. Whenever a new field is added, it should be marked how bad the lack of it is. E.g. additional optional field should not result even in warning, adding a mandatory field should result in invalid config if there is no sensible default.

@arvindsv regarding config-api changes. These will be first thing I do I'll send you a link to review.

How is an old version of the config on the filesystem handled by the plugin, when asking for a specified version?

Plugin is only responsible for the format of single checkout. It has no knowledge of the used SCM. There is ConfigMaterialService which performs a checkout, on demand, then asks the plugin to parse it into configuration part instance. Since that can be slow if there is a lot of requests to older configuration versions, there might be an object cache of latest configuration parts.

I assume that the ones in the main config are the initial list of pipelines, groups and environments into which the ones from the repos are added (if they're valid).

Yes. But please confirm that you understood that model is not appending to CruiseConfig. Instead it is:

  1. create CruiseConfig from xml (it is valid by itself), by current Go configuration system.
  2. create PartialConfig from each config repository, by plugin.
  3. create MergeCruiseConfig from CruiseConfig and many PartialConfig instances.

class names may change

How are we handling atomicity? You mention that a wrong pipeline config from the repo can cause the config to be invalid. Will environments be added already? I'd assume a whole new version of the repo will either be valid and added, or invalid and not merged into the main config?

atomicity:

All merging and validation is done internally in ConfigMergeService. When new merge configuration is created - its instance of current configuration is replaced, then there is a single 'event' when configuration is updated (propagated to config changed listeners).

Will environments be added already?

Yes, since they seem simpler than pipeline configs I might get some feedback from them faster while coding. But I consider them optional. I may drop them if something goes wrong.

I'd assume a whole new version of the repo will either be valid and added, or invalid and not merged into the main config?

I consider this part of Missing valid config problem and I just don't know yet. Rejecting conflicting configuration part seems like the easiest solution.

However, there is a bit of a corner case: Everything is fine. Invalid commit into config-repo. Server is restarted.

I noticed this case as well. I got a new idea on that: Instead of saving entire merged config each time when new valid version is created we just store the sources to reconstruct it at server start. So that server can know URLs of repositories and revisions in each of them to build a merged configuration again. On start it could checkout each part at specified revision and do the merging.

I see it not showing the pipelines, if the config is invalid when it (re)starts. That's ok, IMHO.

But I would consider this as enhancement for later at this point.

Actually, adding to what I said just now, I see @mrmanc had asked whether allowing definition of pipelines and dependencies across config repos is necessary. It's a good question. I wouldn't mind not allowing it. It would reduce some complexity. What do you think?

It looks to me that a single configuration repo could be then validated independently of other parts. Which I would consider as a huge benefit. But maybe I am not seeing something.

@jyotisingh

Also, a pipeline can have 'n' number of materials defined. Now that we treat config-repo as another material for the pipeline, there is no way to dictate update for which material happens first. For example, consider a pipeline depends on a config-repo(say git), and a TFS repo. Lets say I committed to the config-repo, and also made a commit to the TFS repo. If TFS repo is picked for an update first, and the pipeline would be scheduled with the commit. In next cycle the config-repo gets updated, and then the pipeline triggers again. There is a wasted build here. If we change the code to not build again using the same revision, then the pipeline would not run with the updated config. Did I make any sense there?

This is discussed above a few times, you can search for 'scm-config consistency'. You can also see in my first comment that it is what I wanted to address in the first place - currently updating code first to expect something in Gos configuration which not there yet may lead to wasted build.

Also, a pipeline can have 'n' number of materials defined. Now that we treat config-repo as another material for the pipeline, there is no way to dictate update for which material happens first.

I am taking care of that. Updates happen in parallel still but it is actually waiting code that needs to be modified so that we wait for configuration materials first.

consider a pipeline depends on a config-repo(say git), and a TFS repo.

It is not recommended to have configuration in other repository than pipeline is building from. But you will have such flexibility anyway. Also mentioned above in https://github.com/gocd/gocd/issues/1133#issuecomment-109008209

Would all such pipelines be stored as separate files in the repo and do they all go to the same file?

There is extension point for that, you can do as you like.

tomzo commented 9 years ago

This an alternative solution to what I proposed so far. Please tell what do you think of this approach/opinion:

Pipelines which build configuration

By proposing configuration in repositories we are shifting work-style to CI configuration as code. Making each repository a Go configuration software project. It is just like any other software project. It must be written (in config repo), built (parsing), tested (validation) and released (accepted into Gos single source of truth). Taking this further we could have pipelines building the Go configuration. Final pipeline being the one merging parts. The benefits of that would be:

My inspiration to such approach is that if somebody actually has 700 pipeline configurations then this is a project with it's own life. But it is slowly becoming not maintainable. Another place where it could work is in rapid development, when new project is rapidly evolving and pipelines are changing very often.

And the amount of work needed to make pipelines building configurations may be actually smaller than what I proposed so far:

But:

It would also work well with current concept of latest, valid global configuration instance.

There would be new task type - ImportConfigurationPart, so existing task extension point can be used.

I can elaborate on this approach more if there is any interest.

arvindsv commented 9 years ago
Regarding migration within the config repo:

After all, this case happens only when Go was upgraded right? So this is not that often.

Correct. Even then, only sometimes.

My gut is that there won't be that many breaking changes.

Usually, yes. The tendency has been to make backwards-compatible, largely optional changes. But, don't know about the future.

... additional optional field should not result even in warning, adding a mandatory field should result in invalid config if there is no sensible default

Right, that happens today, as well. We should try to keep it that way.

... There is ConfigMaterialService which performs a checkout, on demand, then asks the plugin to parse it into configuration part instance.

Since a repo contains multiple pipeline definitions, a checkout to a specific revision will bring along the whole directory, and it will be given to the plugin. From its response, this service will need to get and use only the specific pipeline which needed a rerun. Will have to consider what happens to dependencies. In version 70, pipeline A (the one being rerun on version 70) might have been dependent on pipeline B. In version 75, it is dependent on pipeline C. In some cases, pipeline B might not exist any more. So, it will need to be validated and build failed (or a message shown). If the default is to use latest config, then it doesn't matter for most users, since this checkout won't need to be done there.

Modeling

But please confirm that you understood that model is not appending to CruiseConfig ...

Yes, I understand that the initial config is from the XML, and might ignore the partials (<config-repo> sections) or might just ensure that they follow the schema, but not check the contents. Then, the partials are brought in, merged and a MergeCruiseConfig object now acts like the full config and adheres to the CruiseConfig "interface". From then on, everything that needs the "latest, valid config" uses that. It'll have extra behavior to get its partials, etc. as needed. That's what I understand.

... Rejecting conflicting configuration part seems like the easiest solution.

When you say "part", you mean <pipeline>? Not the whole <config-repo>, I think? I thought it'd be easier to ignore the whole repo.

So that server can know URLs of repositories and revisions in each of them to build a merged configuration again

I think there's no getting away from the possibility of a wrong config though. If the repository itself is unavailable, then it will have to handle this case during a restart and do something about it (even if it has the URL and the revisions).

[pipelines not showing up on dashboard] But I would consider this as enhancement for later at this point.

I think it's something that will come up, and cannot be handled later. If it is invalid, it is invalid. How that validity is handled decides whether the pipelines can show in the dashboard or not, I feel. But, it is not extra effort. If the pipelines are not in the current config, because the repo is invalid, then the pipelines won't show, that's all.

It looks to me that a single configuration repo could be then validated independently of other parts. Which I would consider as a huge benefit. But maybe I am not seeing something.

Right. I think I'm arguing for the same. I don't know. From the earlier proposal, it looked like this C->A below is valid, and D->B is invalid:

<cruise>
  <pipeline name="A">...</pipeline>
  <config-repo url="abc">
      <!-- Defines <pipeline name="B"> -->
  </config-repo>

  <config-repo url="def">
      <!-- Has a <pipeline name="C">, which depends on pipeline A -->
      <!-- Has a <pipeline name="D">, which depends on pipeline B -->
  </config-repo>
</cruise>

I was thinking that not allowing C to reference A would be good. That means that the config repo is independently validatable. It would force each repo to define all of its dependent pipelines. But, I can see how it might be limiting, and might cause teams to not have the ability to talk to each other and fetch artifacts from each other's pipelines, if each team used their own <config-repo>.

arvindsv commented 9 years ago

About the alternate proposal: I am smiling about the recursive nature of it, and it does seem to have the advantages you mention. I probably need a little more time to think through this. Some very early-stage thoughts:

I think, it might be easier to have some dummy example repos setup with main config and config-repo parts, for everyone to see how the dependencies would work, and how the dashboard would show, for every commit. It's becoming a little difficult to keep all of this in my head. For me at least. :)

tomzo commented 9 years ago

When you say "part", you mean ? Not the whole , I think? I thought it'd be easier to ignore the whole repo.

I meant ignoring whole config repository - <config-repo>.

I will comment on the second proposal later. I thought of it yesterday and it's just too early.

I think, it might be easier to have some dummy example repos setup with main config and config-repo parts, for everyone to see how the dependencies would work, and how the dashboard would show, for every commit. It's becoming a little difficult to keep all of this in my head. For me at least. :)

OK, good idea. I will prepare something like that. It fits as example to both ideas.

tomzo commented 9 years ago

I have prepared example config repositories. In order of complexity:

  1. https://github.com/tomzo/gocd-main-config contains main cruise configuration. The one stored in /etc/go/cruise-config.xml
  2. https://github.com/tomzo/gocd-indep-config-part - configuration part with no external references. At least this must be supported.
  3. https://github.com/tomzo/gocd-refmain-config-part - configuration part that refers to pipelines from main.
  4. https://github.com/tomzo/gocd-refpart-config-part configuration part with references to other configuration part repository

In main config https://github.com/tomzo/gocd-main-config there is config-repos branch with config-repo sections to import elements from the other repositories. There is also added equivalent configuration xml and some screenshots with a beautiful painting of what would be different in UI.

I will add a branch for second proposal soon.

arvindsv commented 9 years ago

Thanks. That is helpful. The stage tag can have an <authorization> section, I think. Will that be allowed? I guess so.

I know this is a plugin implementation decision, but I'd just use .go.xml or .gocd.xml, instead of .cruise.xml, since cruise is the old name, and might confuse some people.

... some screenshots with a beautiful painting of what would be different in UI.

:) I thought it was nice.

tomzo commented 9 years ago

The stage tag can have an section, I think. Will that be allowed? I guess so.

I forgot that. It should be allowed. I forbid authorization section of entire pipe group to avoid conflicts (and it makes more sense anyway).

I know this is a plugin implementation decision, but I'd just use .go.xml or .gocd.xml, instead of .cruise.xml, since cruise is the old name, and might confuse some people.

Will do.

tomzo commented 9 years ago

..whoops

I have drafted how I imagine pipelines building pipelines. This is just an example with single, independent configuration part repository.

There are vsm graphs of cases that we discussed.

Would it work well with the concept of code repo + config repo being the same one?

It is on pipe1 - https://github.com/tomzo/gocd-main-config/tree/config-pipes#pipeline-1 dependencies of pipe1 are config-builder and scm repository. Because it is the same repo that config-builder uses then there is standard Go's dependency resolution goodness.

tomzo commented 9 years ago

Security: Sending (current config + config-repo) or equivalent to a possibly unknown agent. Can be mitigated a bit with resources to decide agent. Unless current config is not needed, to validate the repo (is this the case?).

Here I was thinking to make each config-repo totally independent. So It would not need main config to validate itself. But it could be sent to agent. As you propose, resources, proper scheduling would do it.

User flow: Does user add a and Go automatically creates a pipeline for it? Or, does user create a pipeline for it, and Go sees that it is a special pipeline? I guess, former.

I'd say full control to the user. Plugin would provide extra tasks (parse, validate, import). Merge task would be embedded Go. I think pipeline should not be special, if possible.

As you said, merging is still needed. I don't understand why tracking is not needed.

I meant extra tracking code in UI, because there are pipelines, and vsm, and diffs all helping to tell what is being used in each pipeline.

If there are three quick commits to the repo, then there's a chance that Go will bunch commits into one run. So, there might not be a run of the pipeline whose artifacts are those of the middle commit, for instance. Right?

You'll have to explain this some more. I know how bunching commits together works, but which pipeline run do you refer to? the one building config or the dynamically created one? I don't see a problem in not importing some configuration or running less pipeline runs.

Migration is still a problem. However, earlier, you could at least have asked the plugin for help giving you the latest pipeline object. However, now, Go is left with some (presumably JSON) artifacts representing an old config, and has to figure out how to upgrade it to the current version, right?

Yes, I was thinking JSON. And migration is basically the same problem + there is middle format (JSON) in the system. However we could benefit from that - because now we always have each config part at last point in JSON, which is imported on server side anyway and we can take care of migrating its contents. E.g. if field is missing.

Un-availability of agent: If the agent is busy doing something else, then a config change will not be seen by Go, since the pipeline won't run.

That will be users responsibility to allocate enough agents for configs. We could recommend dedicating some agent for that. Or actually these pipelines could run on server, single service would do it, I think.

I have also some (hopefully) final thoughts on using old configurations. I will post these soon.

tomzo commented 9 years ago

1 Rerun on old config problem

Assumptions:

Problem: If we try to rerun on old config then it might happen that old dependency is now missing, or renamed, or has other stages, etc.

Immediate solution is to drop such feature. Just always use latest config. Invalid config cannot be introduced so dependencies never get broken. Which is what I will do now.

2 Run old pipeline problem

When running old pipeline on latest config the same problem exists. (Try running pipeline from a year ago). Latest configuration is different, other stages, other dependencies.

True reason and some design thoughts

First problem is just a symptom of trying to attribute a configuration to sole pipeline. While the reality is that valid configuration is attribute of a larger context.

For rerun on old configuration to work: The pipeline rerun would have to execute in context of that larger context (where configuration was valid, consistent, all dependencies exist).

So in more abstract language - when I try to rerun a pipeline on old configuration I am trying to force part of that old context to exist simultaneously with the latest context. Which is now impossible because of how GoConfigService is designed now:

We spoke previously that validation scopes would be nice gesture towards future of Go. In my opinion here they are - the large context mentioned above.

I believe this is very important:

Identifying this and actually modeling the internal Go services to operate on contexts could have huge benefits to what Go would be capable of:

I know it may sound like future-speak now. But with config repositories I will be operating on domain that obviously has such contexts but services in Go do not. The considerations about allowing dependencies between configuration repos, main config, reruns etc. it is just an example of above abstract.

My questions

tomzo commented 9 years ago

I have added critical changes in config-api already discussed previously.

All changes can be reviewed at https://github.com/gocd/gocd/compare/master...tomzo:1133-api

I will be working on top of that in next branch. So I would appreciate any complains sooner rather than later.

My general impression is that it will work. I already managed to implement those merge classes and even route some update calls to the main configuration. Making CruiseConfig, EnvironmentConfig, PipelineConfigs into interfaces required updates mainly in tests so that there is new BasicCruiseConfig() instead of new CruiseConfig(). I have also initially updated rails to check if this did not break anything - https://github.com/gocd/gocd/compare/master...tomzo:1133-rails

tomzo commented 9 years ago

Also this interface seems to be listening for too much:

public interface ConfigChangedListener {
    void onConfigChange(CruiseConfig newCruiseConfig);
}

This is listening for any update of configuration. There is (almost) no real listener that would consume entire CruiseConfig. This should be replaced by smaller listeners like PipelineConfigsChangedListener, EnvironmentsConfigChangedListener.

arvindsv commented 9 years ago

This should be replaced by smaller listeners like PipelineConfigsChangedListener, EnvironmentsConfigChangedListener.

I agree that it's the right thing to do. I guess the issue with doing that is a full XML edit. When that is done, there are too many corner cases in doing a diff between the old and new to figure out what changed, and which listeners to notify.

I've ranted enough about mistakes related to allowing the config to be edited, in #838. So, I won't start again.

arvindsv commented 9 years ago

Ok, reviewing that took a long time. 25 commits. But, the good commit messages and the fact that they did one thing, usually, helped.

My overall thought regarding the modeling is similar to what I said here - regarding inheritance vs. just extension. The more I look at it, the less is the difference between MergeCruiseConfig and BasicCruiseConfig, between MergePipelineConfigs and BasicPipelineConfigs. Most of the logic has to be repeated. The biggest part that changes is how they are created. Arguably that can be done externally, and using ConfigOrigin and maybe a strategy, all of this can be achieved without cleaving the whole existing model and having a parallel hierarchy.

The fact that the tests are all duplicated (and extra ones added) and there are quite a few class-based conditions adds to that feeling. It tells me that the two are nearly the same, and so, shouldn't be an inheritance relationship. Does what I say make sense? Let me know.

Putting that apart for now, my thoughts on the changes themselves:

List interface of PipelineConfigs

I wonder if it makes sense to change PipelineConfigs from extends List<PipelineConfig> to extends Iterable<PipelineConfig>? It would save you from having to implement a bunch of irrelevant methods in MergePipelineConfigs. We'd need to put a couple of methods such as size(), contains(), etc. on the interface, but that is better than bringing the whole big List interface, I think.

Duplication in MergePipelineConfigs

I'd remove a bit of duplication in MergePipelineConfigs. I'd change the initial constructor parts to something like this:

public MergePipelineConfigs(PipelineConfigs... parts) {
    this.parts.addAll(Arrays.asList(parts));
    validateGroupNameUniqueness(this.parts);
}

public MergePipelineConfigs(List<PipelineConfigs> parts) {
    this.parts = parts;
    validateGroupNameUniqueness(this.parts);
}

private void validateGroupNameUniqueness(List<PipelineConfigs> parts) {
    String name = parts.get(0).getGroup();
    for (PipelineConfigs part : parts) {
        String otherName = part.getGroup();
        if (!StringUtils.equals(otherName, name))
            throw new IllegalArgumentException("Group names must be the same in merge");
    }
}

But, as I said earlier, there is a lot of duplication across classes and that makes me a little anxious.

Unimplemented methods

There are many parts such as MergePipelineConfigs#removeAll which return a false, since they're not yet implemented (as you mention in e41e3f7). I think it would be better to replace these with throw new RuntimeException("TODO: Not implemented yet"); so that they fail if used from somewhere. If not, we need to be careful that they are implemented before everything is done.

Class-based conditions

Do you think it makes sense to change usages of code like this.configOrigin instanceof FileConfigOrigin (in BasicPipelineConfigs#validate, for instance) to something like: this.configOrigin.isLocal()? I think it's more easily understandable that authorization can only be local, etc. and refactoring is easier.

Didn't understand test

Can you explain shouldValidateThatAuthorizationCannotBeInRemoteRepository in BasicPipelineConfigsTest? Why is it that an origin should not be set on BasicPipelineConfigs? More importantly, how does that relate to authorization? Especially since there's no setAuthorization called in that test.

Test name confusion

Can you change tests like shouldReturnFalseIfPipelineNotExistIn2Parts to shouldReturnFalseIfPipelineNotExist_WhenIn2Parts or something like that? It took me a while to understand that it is not about the pipeline not existing in two parts, but it is about a pipeline existance check when there are two parts.

tomzo commented 9 years ago

duplication

I do not like it either and I am open to working solution that does not require this much duplication. (below is what I planned so far)

The fact that the tests are all duplicated (and extra ones added) and there are quite a few class-based conditions adds to that feeling. It tells me that the two are nearly the same, and so, shouldn't be an inheritance relationship. Does what I say make sense? Let me know.

My plan was to pull abstract class under MergeCruiseConfig and BasicCruiseConfig with common implementation. Similar thing can be done in unit tests. I wanted to leave it for later. They look similar, some methods are copy-paste, but some have just minor differences in 1 line out of 10 so it requires some gymnastics to figure out what should be abstract. There is some code yet to be implemented (mostly for updates) and that will not be the same as in BasicCruiseConfig implementation. There is also missing validation of references between parts, which does not apply in BasicCruiseConfig as well. I wanted to write it first and then refactor.

There are 2 reasons why I went for MergeCruiseConfig and CruiseConfig being an interface:

But that was very long ago (4 days) and now I see that they are similar.

The biggest part that changes is how they are created. Arguably that can be done externally, and using ConfigOrigin and maybe a strategy, all of this can be achieved without cleaving the whole existing model and having a parallel hierarchy.

You would have to be more specific, I just don't see it. There is like 200 references to CruiseConfig that expect to be able to create it directly with a constructor. They also have no knowledge of ConfigOrigin.

I wonder if it makes sense to change PipelineConfigs from extends List to extends Iterable? It would save you from having to implement a bunch of irrelevant methods in MergePipelineConfigs. We'd need to put a couple of methods such as size(), contains(), etc. on the interface, but that is better than bringing the whole big List interface, I think.

OK, I do not want to implement them either. I wasn't sure how much of it is used higher because old PipelineConfigs was in fact a list.

Unimplemented methods

I will be throwing now.

Class-based conditions

OK. It does make more sense.

shouldValidateThatAuthorizationCannotBeInRemoteRepository

You made me realize that this test has 2 errors. I will fix this. I wanted to validate that Authorization section is defined only locally.

Test name confusion

Fine. I will put more effort into test names. Sorry about that.

Ok, reviewing that took a long time. 25 commits. But, the good commit messages and the fact that they did one thing, usually, helped.

Thanks for review.:)

arvindsv commented 9 years ago

My plan was to pull abstract class under MergeCruiseConfig and BasicCruiseConfig with common implementation. Similar thing can be done in unit tests. I wanted to leave it for later. They look similar, some methods are copy-paste, but some have just minor differences in 1 line out of 10 so it requires some gymnastics to figure out what should be abstract.

Yes, I was thinking of that as well, abstract for both. But, once I realized that most of it would be in the abstract class, I felt that it was wrong to introduce that hierarchy just for that.

There is some code yet to be implemented (mostly for updates) and that will not be the same as in BasicCruiseConfig implementation. There is also missing validation of references between parts, which does not apply in BasicCruiseConfig as well.

That's where I felt a strategy based on ConfigOrigin might help. I've tried to explain this below.

There are 2 reasons why I went for MergeCruiseConfig and CruiseConfig being an interface: I have full control of MergeCruiseConfig implementation. (Which initially I thought would be much different than BasicCruiseConfig). I think CruiseConfig is too big. Now when it is an interface it can be split into smaller interfaces. Then the rest of the system can start using just parts of it. Then there can be implementations smaller than CruiseConfig at once.

Yes, if they were very different, it would make sense to me. However, even if it is only one class (MergeCruiseConfig does not exist), pulling it out into an interface makes sense, to do all that you said. We don't need to have two implementations of CruiseConfig interface, for it to be an interface. What you say about the size of it is spot on.

You would have to be more specific, I just don't see it. There is like 200 references to CruiseConfig that expect to be able to create it directly with a constructor. They also have no knowledge of ConfigOrigin.

What I mean is, if you ignore test usages for now, there should only be one real usage of the CruiseConfig constructor. Somewhere around here where it is converted from the XML to the object. All that needs to happen there is to have a first pass nearly ignoring the <config-repo> sections and creating a config (BasicCruiseConfig) and then, on a second pass talking to the plugins, to get their config-related representations (PipelineConfig, EnvironmentConfig) of whatever they have in their repo, and putting the correct ConfigOrigin object inside them. At this point, we can inject a strategy object in them for behavior differences between when they're loaded locally or remotely. Or, the ConfigOrigin itself can do that (but I'd still think it's better to move that out of the nice and small ConfigOrigin object.

If you agree with that, then changing the test usages of the constructor should not be hard, since none of the existing tests would have any <config-repo> sections.

Once this is done, everything continues to work, since the config object has not changed at all. More importantly, work is reduced, I feel.

tomzo commented 9 years ago

I get what you mean now. Avoiding 2 classes of CruiseConfig sounds reasonable.

I am a little worried through about this:

All that needs to happen there is to have a first pass nearly ignoring the sections and creating a config (BasicCruiseConfig) and then, on a second pass talking to the plugins, to get their config-related representations (PipelineConfig, EnvironmentConfig) of whatever they have in their repo, and putting the correct ConfigOrigin object inside them.

The second pass that you mention here is done much later (after materials get polled and updated on db for the first time), much higher in services.

Or, the ConfigOrigin itself can do that (but I'd still think it's better to move that out of the nice and small ConfigOrigin object.

I agree. Let ConfigOrigin be just an indicator of where config is from. The strategy classes would be separate.

I will continue to work on services now since that will shine more light on what is actually expected from (Merge)CruiseConfig. I will then get back to refactoring api.

arvindsv commented 9 years ago

I am a little worried through about this: [remove part related to first and second pass] The second pass that you mention here is done much later (after materials get polled and updated on db for the first time), much higher in services.

Ah, yes. You're right. With or without MergeCruiseConfig, that's an issue that needs to be dealt with. Upon startup, the partials might not be available. It is up to us to figure out what to do at that time. My gut says, ignore those partials till polling has happened and the config needs to be updated. The partials cannot cause the main config to be invalid (till they're present, that is). So, the main config can be consider a valid subset of whatever the full config will be, once polling is done.

So, you're right that the second pass won't happen immediately (or maybe ever). It'll only be the first pass, where the ConfigRepoConfig objects are created, but the full config, with the pipelines defined in them are not yet available. As the repos get polled, and become available, the main config will get those pipelines made available to it, potentially changing how the dashboard looks to the user.

During that startup time, we should show a warning that those parts are not yet available, I guess.

tomzo commented 9 years ago

My gut says, ignore those partials till polling has happened and the config needs to be updated.

This is what I am going for now. The remote parts will not be visible at all until first polling completes.

tomzo commented 9 years ago

I am trying to design big picture of all new services and how each would operate. I have to understand all current services well so please correct me if I am wrong anywhere below

My understanding of material polling

Config holder and config for edit

Questions

Is config on edit ever rendered in UI? Is it used anywhere outside of java core services? Do you think it would make any sense to return the merged cruise config as config for edit? Knowing that it is mostly uneditable? Do you think system (dashboard being most affected I guess) behave nicely when we return merged cruise config as current config, but smaller, main cruise config as config for edit?

Assumptions and goals

This is a specification of what I am currently going for in how services would operate. So if there is some point you don't like then please let me know.

Behavior and features

Hung material

What happens when one material polling gets hung:

Failed parsing

When plugin fails or bad format or migration fails in configuration repo checkout then material update completes but config partial is old.

reuse pollers directories

We should reuse the checkouts (in pipelines/flywieght) done by standard material pollers when doing update on db. It Implies that it must be done as part of polling so that between MaterialUpdateMessage and MaterialUpdateCompletedMessage. Is this acceptable?

cruise config with strategy pattern

If we go down the strategy pattern instead of merged config then would it matter to magical xml writer which strategy is currently injected into CruiseConfig? I guess not. If it serializes based on reflected object graph only. It does not call any methods on CruiseConfig, or does it? I am asking about this because I would not want to return MergedCruiseConfig in configForEdit. From perspective of assembling a new merged config I do not think that it matters if we inject a new strategy or create a new MergedCruiseConfig class. But it seems to me now that work would be reduced around the config for edit if we use strategy.

[I will post description of services I have designed so far tomorrow, they are drafted in code already, no tests]

tomzo commented 9 years ago

Services

There is more comments than code now, it does not work yet, but it looks to me that if implemented it might.

I do not think that ScmCheckoutService should work the way it does now. It is not really a service with real dependencies, instead it relies on being called from the top so that it could do it's job. I think it is OK to have it like that initially but in the end it should be that MaterialDatabaseUpdater and GoRepoConfigDataSource both use ScmCheckoutService to get their clean checkout. What do you think?

@arvindsv Please comment on the other services as well, I will then start to implement them and cover by tests.

arvindsv commented 9 years ago

[I'll reply in parts. That means there will be a lot of replies. Sorry people who get notifications through email]

My understanding of material polling

Everything you said about this is correct. The trackingId is just for performance logging, and for nothing else. It can be ignored when analyzing this. About the slow and fast pollers: Yes. If there's a slow update, then the other materials won't be picked up for checking as quickly (there are 10 material polling threads, which pick up MaterialUpdateMessage messages from the queue). If 2 are slow, the other 8 will pick up the load. Again, trackingId is irrelevant. If you see, the inProgress "queue" has materials only, and the material that comes back in the MaterialUpdateCompletedMessage is the one that is used for the queue (to prevent the same material from being picked up for material polling).

Config holder and config for edit

Again, correct. UpdateConfigCommand is used from the rails parts a lot (and that's not very nicely done IMO). But, the concept is not bad. XmlPartialSaver is probably used just in tests and maybe when editing the whole XML. Either way, it should be deprecated and maybe removed soon.

Questions

Is config on edit ever rendered in UI?

GoConfigHolder has configForEdit (config as seen by user in the filesystem and in the "Config XML" section) and config (the real config used by most of the system, after preprocessing. So, the "configForEdit" will be edited in the UI. Or loaded from the file system.

Is it used anywhere outside of java core services?

I think so. If you search for "getConfigForEditing", you'll see some usages in Rails.

Do you think it would make any sense to return the merged cruise config as config for edit? Knowing that it is mostly uneditable?

I don't think so. For instance, the config for edit is used to show the XML in the "Config XML" section in Admin. Seeing pipelines from a pipeline repo would be confusing.

Do you think system (dashboard being most affected I guess) behave nicely when we return merged cruise config as current config, but smaller, main cruise config as config for edit?

I think so. Dashboard shouldn't really care about config for edit. It uses the GoConfigService to get a list of pipelines and then works on it.

arvindsv commented 9 years ago

Behavior and features

  • references allowed from partial configurations to each other or to main. I still do not see anything blocking that, it just a matter of last validation before accepting merged cruise config

Right. As long as everything is valid at the end, it doesn't matter. However, if pipeline-repo-1 is dependent on pipeline-repo-2, then the unavailability of pipeline-repo-2 for some reason (invalid or not reachable) means that pipeline-repo-1's changes will not be accepted as well.

  • when server starts it loads only main configuration from xml. Then we wait for material updates defined config-repo then merging configuration kicks in, fires config changed event a few times as each partial gets successfully merged into current config.

Ok. Makes sense. Again, because of the decision to allow inter-pipeline-repo dependencies, if pipeline-repo-1 comes in first and depends on pipeline-repo-2 whose config change event is not yet fired, what happens? When pipeline-repo-2's config change event comes in, will the "latest" of pipeline-repo-1 be considered? Basically, how is order decided, if there's no "gong" (timer) which says I want the latest of all partial pipeline repos.

as material updates are done, so are parsings of each partial configuration, then event is fired that new merged config must be assembled.

Continuation of previous point. Inter-pipeline-repo dependencies might not allow all partial configs to be valid.

there is never a situation when invalid merged config is considered as current config

:+1: Very important, in my view. Once this is adhered to, the rest of the system can continue depending on the current config as a valid one.

Hung material

I think what you say makes sense.

Failed parsing

  • when config repo and pipeline material is the same - if pipeline would schedule then it would use old configuration with new commit violating config-scm consistency. it will not be allowed to schedule until partial is fixed.

Right. That makes sense, but it needs a change in the scheduler, where it needs to check the ConfigOrigin (and revision) of the config of a pipeline, before deciding to schedule it. Might be worth working with @jyotisingh or @srinivasupadhya, who know that much better than I do. I'd live with this inconsistency for now, and get to it at the end, since it's a different part of the system.

  • when config repo and pipeline material is different - same as in hung case. (latest partial is then old. if pipeline would schedule then it would use configuration from old commit in config repo with new commit from scm repo.)

Ok, makes sense.

reuse pollers directories

Yes, what you're saying is acceptable. I think they can be reused (and since we're talking about scheduling, etc. it'll need to be reused). The only thing I'd consider during polling is to try and prioritize config polling over other material polling. We'd have to worry about starvation, etc. but having config polling waiting in queue, because of slow material pollers might be ... iffy. Again, we can use the default queue and same priority for now, but might need to consider changing it later.

cruise config with strategy pattern

If we go down the strategy pattern instead of merged config then would it matter to magical xml writer which strategy is currently injected into CruiseConfig? I guess not. If it serializes based on reflected object graph only. It does not call any methods on CruiseConfig, or does it?

Only a few. It calls methods annotated with PostConstruct from here. Apart from that, nothing else that I know of.

I am asking about this because I would not want to return MergedCruiseConfig in configForEdit. From perspective of assembling a new merged config I do not think that it matters if we inject a new strategy or create a new MergedCruiseConfig class. But it seems to me now that work would be reduced around the config for edit if we use strategy.

I'm not sure I understand this. I understand that you don't want to (and shouldn't) return the merged config in config for edit. But, I don't understand the rest. It might help if you give me an example of what you think will be in the strategy. For instance, I thought something like "canPipelineBeEdited" might be in the strategy. I don't know what that should return in a config for edit, where it might not have any idea about the pipeline (if it is in a repo).

[I'll need a little more time to review the "Services" comment and the related code. Hope that's ok. I'll try and do it later today, or early tomorrow morning.]

tomzo commented 9 years ago

As long as everything is valid at the end, it doesn't matter. However, if pipeline-repo-1 is dependent on pipeline-repo-2, then the unavailability of pipeline-repo-2 for some reason (invalid or not reachable) means that pipeline-repo-1's changes will not be accepted as well.

and

Again, because of the decision to allow inter-pipeline-repo dependencies, if pipeline-repo-1 comes in first and depends on pipeline-repo-2 whose config change event is not yet fired, what happens? When pipeline-repo-2's config change event comes in, will the "latest" of pipeline-repo-1 be considered? Basically, how is order decided, if there's no "gong" (timer) which says I want the latest of all partial pipeline repos.

I should have mentioned that in such case I was planning to tolerate such case: Let's say pipeline-repo-1has new reference to new element in pipeline-repo-2. It gets pushed to config repo.

  1. pipeline-repo-1 gets parsed
  2. pipeline-repo-2 is not yet parsed (because it is slower or not pushed yet), so attempt to merge pipeline-repo-1 into current configuration will fail validation. This merged config does not get accepted further, it is wasted. We keep stop scheduling pipes from repo-1
  3. lazy pipeline-repo-2 comes in, gets parsed. attempt to merge all parts again is successful this time, enabling repo-1 and repo-2

You will get that also from the code I written so far.

I was also considering waiting for all/some configuration repositories to finish polling before assembling a new merged config. But it got too complex so I decided to take this shorter, simpler solution.

Best solution?

At some point I got to the point where It seems that best (by quality) solution would be that each config-repo section must define extra information about which references are allowed e.g. repo-1 declares 'I may have references to repo-2'. Then in services this allows to do 2 things:

What do you think of this? I decided against only because of too much effort.

cruise config with strategy pattern

I'm not sure I understand this. I understand that you don't want to (and shouldn't) return the merged config in config for edit. But, I don't understand the rest. It might help if you give me an example of what you think will be in the strategy. For instance, I thought something like "canPipelineBeEdited" might be in the strategy. I don't know what that should return in a config for edit, where it might not have any idea about the pipeline (if it is in a repo).

I think strategy should have the partial configs as members. It would be aware of ConfigOrigin in each config instance and interpret those to tell things like canPipelineBeEdited . What I meant by reduced work was that with strategy we could easily have 3 strategies - basic, merged, merged for edit. The third one would be aware of all configs from repo so that it can return false in case you mention:

I thought something like "canPipelineBeEdited" might be in the strategy. I don't know what that should return in a config for edit, where it might not have any idea about the pipeline (if it is in a repo).

But it would still make cruise config look like the basic, editable xml only. Basically it would hide config repo parts but use them only when necessary.

arvindsv commented 9 years ago

Services

I can't claim to have looked closely at every change, but I've read all that you wrote above and I've tried to understand it. I agree with most of it. I've written down some thoughts below:

About this line, I understand the rationale for stopping scheduling of the existing pipelines from that repo (since the latest of that repo is invalid). But, is it what users would expect? I like stopping it actually, but thinking about others.

Here, doesn't it need to know the material revision? If it doesn't, it'll always need to work with the latest revision. I suppose since it is in the update flow, it will always be the latest that it sees anyway. The callback means that the whole config updation is now directly tied to the material update. If config update is slow, then one of the ten material update listener threads will be busy. Worse, if there are many partials, then since it's a config update, there will be a lock obtained somewhere and multiple listener threads will be busy, bringing the rest of the pollers down. What do you think? Do you think it makes sense to give pipeline repo materials their own thread or two, to do their update?

I'm not sure CachedGoConfig is the right place to try and merge a config. From its name, I'd expect it to be aware of the latest config (cached) so that it's easily/quickly available. I feel something else should do it, and on a successful merge, should tell the cache to update itself. Not very particular about this, but it seemed like the wrong place.

GoConfigDao is already quite big (maybe only because of the many UpdateConfigCommand implementations in it). I wonder how adding behavior related to the partials will expand it. Will the actual addition of a partial, etc. be the kind of things it knows about? Anything else?

I see that you're using the tracking ID in PartialConfigStatusService. The tracking ID is just an ID for performance logging. It has no guarantees. Don't use that. You can directly use the material itself, or its fingerprint or something else. What are you trying to use that tracking ID for?

I do not think that ScmCheckoutService should work the way it does now. It is not really a service with real dependencies, instead it relies on being called from the top so that it could do it's job. I think it is OK to have it like that initially but in the end it should be that MaterialDatabaseUpdater and GoRepoConfigDataSource both use ScmCheckoutService to get their clean checkout. What do you think?

Tell me more about this. MaterialDatabaseUpdater knows how to update a material picked up off of the queue for polling. ScmMaterialCheckoutService says it "Provides a clean checkout of SCM materials", but it just seems to listen to the event from the ScmMaterialUpdater that it is done processing a material. Are you saying that the ScmMaterialCheckoutService is the one that should be doing what ScmMaterialUpdater is doing and then both MaterialDatabaseUpdater and GoRepoConfigDataSource can use it? If so, shouldn't GoRepoConfigDataSource just use ScmMaterialUpdater directly? They're both inside server module. It just looks like ScmMaterialUpdater is not a public class, and so it cannot yet be used inside GoRepoConfigDataSource till that is changed.

MaterialUpdateListener has a GoRepoConfigDataSource field, which is unused. What is the plan for that?

arvindsv commented 9 years ago

lazy pipeline-repo-2 comes in, gets parsed. attempt to merge all parts again is successful this time, enabling repo-1 and repo-2 You will get that also from the code I written so far.

Yes, I see it. That makes sense. My worry was that the latest of repo-1 would be tossed. But, it is kept and used. So, that's good and will work.

[Idea about partials declaring dependencies] What do you think of this? I decided against only because of too much effort.

I think it is too much effort, and might be too confusing for users. The code can take care of it, the way you've done it now, I feel.

But it would still make cruise config look like the basic, editable xml only. Basically it would hide config repo parts but use them only when necessary.

Alright. Yes, that's the whole deal with the strategy. As long as it does that, I think it's fine. I think you're saying that a "basic" strategy can be injected into the config if it has no partials, a couple of "merged" ones for the other two cases. They all decide what to return, based on what they know. That makes sense.

arvindsv commented 9 years ago

To me, another way of handling the partial is to not change MaterialUpdateListener and MaterialDatabaseUpdater at all. The "pipeline" for a pipeline repo poll should be something like:

In or around MaterialUpdateService:

  1. Is pipeline-repo-1 "in process" right now?
    1. No? Good. Mark it as being processed. Go to step 2.
    2. Yes? Don't do anything. Wait for next poll interval.
  2. Put it on the queue to be checked for changes.
  3. Once it is checked, the onMessage method is called with a MaterialUpdateCompletedMessage. That should tell the MaterialUpdateService whether a change has occurred.
  4. Has a change occurred?
    1. No? Mark pipeline-repo-1 as not being processed. Wait for next poll interval.
    2. Yes? Then, fire the right listeners so that GoRepoConfigDataSource#onCheckoutComplete gets called, and the partial is processed. During this time, the "is being processed" flag is still on, and that repo will not be polled.
  5. Once that is done, the "is being processed" flag will be unset, and we go back to wait for next poll interval.

This means that the material polling is not affected by a slow config save.

tomzo commented 9 years ago

I can't claim to have looked closely at every change, but I've read all that you wrote above and I've tried to understand it. I agree with most of it. I've written down some thoughts below:

I do not expect you to read line by line, general discussion is very helpful for me.

I understand the rationale for stopping scheduling of the existing pipelines from that repo (since the latest of that repo is invalid). But, is it what users would expect? I like stopping it actually, but thinking about others.

I planned to do that because of as user I would not want to have old pipeline built on new code revision. But if you worry that it is not that what all users would want then maybe we should make it optional. After all it should not be hard to make it part of config-repo configuration. Is that what you suggest?

...The callback means that the whole config updation is now directly tied to the material update. If config update is slow, then one of the ten material update listener threads will be busy. Worse, if there are many partials, then since it's a config update, there will be a lock obtained somewhere and multiple listener threads will be busy, bringing the rest of the pollers down. What do you think? Do you think it makes sense to give pipeline repo materials their own thread or two, to do their update?

Yes, this is a little horror case if we consider high load, slow polling, etc. This is why I mentioned the hack in ScmMaterialCheckoutService. Currently I just added this fake service to get config-repo code running. But now entire parsing, validation and even final merging is done in the polling threads. So in the end it would have to be different. And yes, I guess that dedicating few threads + actually not parsing in the pollers threads should do it. There is just quite a lot of work to build that so I made fake service now.

I'm not sure CachedGoConfig is the right place to try and merge a config. From its name, I'd expect it to be aware of the latest config (cached) so that it's easily/quickly available

OK :) The name is confusing. But it is just temporarily wrong, I will rename it to MergeConfigService or something like that. I left the name because all services higher used to talk to CachedGoConfig. Which was a quite small class, whose public methods I left almost untouched, but replaced implementation with partials-aware. It still does the same job that CachedGoConfig did. Only difference is that now it talks to partial-related services and old file-related services.

GoConfigDao is already quite big (maybe only because of the many UpdateConfigCommand implementations in it). I wonder how adding behavior related to the partials will expand it. Will the actual addition of a partial, etc. be the kind of things it knows about? Anything else?

Probably it will not require any changes in existing methods. I rather expected it to only add command for modifying a list of config-repos. But that should not be different that modifying list of environments. It looks good at the moment. I guess will have to see it during implementation.

I see that you're using the tracking ID in PartialConfigStatusService. The tracking ID is just an ID for performance logging. It has no guarantees. Don't use that. You can directly use the material itself, or its fingerprint or something else. What are you trying to use that tracking ID for?

OK, I will stop. There is another way to get it done, just by listening to material update messages. This was written before you told me that these are for metrics only.

Tell me more about this. MaterialDatabaseUpdater knows how to update a material picked up off of the queue for polling. ScmMaterialCheckoutService says it "Provides a clean checkout of SCM materials", but it just seems to listen to the event from the ScmMaterialUpdater that it is done processing a material. Are you saying that the ScmMaterialCheckoutService is the one that should be doing what ScmMaterialUpdater is doing and then both MaterialDatabaseUpdater and GoRepoConfigDataSource can use it? If so, shouldn't GoRepoConfigDataSource just use ScmMaterialUpdater directly? They're both inside server module. It just looks like ScmMaterialUpdater is not a public class, and so it cannot yet be used inside GoRepoConfigDataSource till that is changed.

IMO MaterialDatabaseUpdater does too much. It even decides where the checkout directory is. Currently MDU is called from top to update db and it creates checkout for it self using ScmMaterialUpdater or Pluggable updater. All this is done as part of absolutely private procedure. So it is very hard to plug into middle of this. Now with config-repos there are 2 services that need most recent checkout when scm material is updated. This is a big change in services. I suggest to change a little how MDU operates so that it would listen for checkout event, just like GoRepoConfigDataSource would. Each would have its turn to look at the checkout, work with the poller. When they are both done - material update completed. Probably some new service will be needed to listen for material update request and do checkout only, then fire event for these 2 consumers. I guess that would be ScmCheckoutService. (I do remember that MDU updates non-scm materials also, so change would affect only scms.)

MaterialUpdateListener has a GoRepoConfigDataSource field, which is unused. What is the plan for that?

It is left-over after my previous hacking. There won't be any changes there.

tomzo commented 9 years ago

I like your idea from second comment. I will think on this some more. But there is one issue that immediately worries me - so far I concluded that configuration of some repo-1 must be parsed, validated and merged before MaterialUpdateCompletedMessage of that repo-1 is fired. Otherwise some higher services may start to act on MaterialUpdateCompletedMessage and they will call something like getPipelineConfig and that will return the old one.

arvindsv commented 9 years ago

[About stopping pipelines from repo scheduling if latest is invalid] ... After all it should not be hard to make it part of config-repo configuration. Is that what you suggest?

No, on second thoughts, let's leave it that way. We can always change it if the feedback is that it is not intuitive. I think it makes sense to show that the current config is invalid, to the user.

There is just quite a lot of work to build that so I made fake service now.

That's fine. It's better to prove it out, before going down that path.

IMO MaterialDatabaseUpdater does too much ... Probably some new service will be needed to listen for material update request and do checkout only, then fire event for these 2 consumers. I guess that would be ScmCheckoutService

Ah, I understand now. Yes, I agree.

But there is one issue that immediately worries me - so far I concluded that configuration of some repo-1 must be parsed, validated and merged before MaterialUpdateCompletedMessage of that repo-1 is fired. Otherwise some higher services may start to act on MaterialUpdateCompletedMessage and they will call something like getPipelineConfig and that will return the old one.

The only other usage of MaterialUpdateCompletedMessage that I know is in scheduler of a pipeline which has been triggered manually. When someone clicks on the trigger button of a pipeline. Then, it goes and schedules all materials for a poll and uses the latest of all.

In that case, if it sees the older pipeline config, then that's what it will try to use. That might fail, because the config does not match the material (as we talked about earlier). Otherwise, the code repo and config repo are different and it doesn't matter that it builds (once) with the old config and new code, right? That could have happened anyway, if the config material polled slowly, etc.

I mean, during scheduling, we will take care of scheduling it properly (SCM revision-wise) and even if an upstream service sees the commit before it is in the config, it should not matter. Especially since the only upstream service using it is scheduler.

tomzo commented 9 years ago

Changed proposed xml schema:

Example:

<cruise>
  <config-repos>
    <!-- plugin name is optional -->
    <config-repo>
      <git url="https://github.com/tomzo/gocd-indep-config-part.git" />
    </config-repo>
    <config-repo plugin="gocd-xml">
      <git url="https://github.com/tomzo/gocd-refmain-config-part.git" />
    </config-repo>
    <config-repo plugin="gocd-xml">
      <git url="https://github.com/tomzo/gocd-refpart-config-part.git" />
      <configuration>
        <property>
          <key>pattern</key>
          <value>*.gocd.xml</value>
        </property>
      </configuration>
    </config-repo>
  </config-repos>
  <!-- the rest of cruise config, nothing new -->
</cruise>

Implemented and tested in d08193a9aa65789724f7b108bfccd45efb7d5384

arvindsv commented 9 years ago

I keep forgetting it is <config-repo> and not <pipeline-repo>. Looks good to me. However, in that commit, there a couple of missing pieces. Schema version needs to change, and an XSL needs to be present to convert existing configs to this version, etc. Here's a document which talks about it. I just looked at this commit. I don't know if it's done elsewhere.

tomzo commented 9 years ago

I know about migration. But I will do it near the end. It would be schema 76 probably, while you still have 75 to release yet.

arvindsv commented 9 years ago

Ok. You said plugin name is optional. What happens if it is omitted?

tomzo commented 9 years ago

You said plugin name is optional. What happens if it is omitted?

The default embedded plugin is used. The one that expects current xml-style to be in config repo like here. I will write it using magical xml loader. Probably using fromXmlPartial methods.

tomzo commented 9 years ago

I have made many changes recently. I am posting here so that it doesn't get messy and hard to review later. I do not need immediate feedback on this, but feel free to comment if you have some thoughts.

Most of these changes reflect what I drafted last time. Now there is some real test coverage for them.

Commits

There are long messages in almost each commit so that should help to read through them.

Schema changes branch is merged to services because I needed it in some tests.

Refactoring in config-api

MergeCruiseConfig is gone and its duplicated tests. Here is how strategy looks today. I also fixed all other things you mentioned in first review.

Merged config for edits

Initially I did not return merged configuration for edit. And I did not allow merged configuration to be serialized by magical xml writer. But soon when doing DAO tests I found out that it is necessary that each pipeline or environment modification should be passed to merged configuration because it must be validated at merged scope. Some example:

A critical change that does this - https://github.com/tomzo/gocd/commit/85422ee37419ef65fa3b81dcf41f75cd34d4fc70#diff-56efae43574f3988ca374a9bbe6237cfR85

Now there is a new method on CruiseConfig to get just the local content to be serialized.

CachedGoConfig

I struggled with this service a little bit to make it work. So it is spread across many commits. There is a commit message which summarizes changes well - https://github.com/tomzo/gocd/commit/80706b39783d286184028c6ab3bc673b38f67bf3

GoConfigDao

It is almost untouched. All work is done below it. So unlike I stated previously it does not need to know about config parts. I still have to add update command for config-repos. But this is minor change.

Abstract tests

A notable change in ConfigDao and CachedGoConfig tests:

  1. Make current test class abstract, rename to something like BaseTest
  2. Create 'basic' test class which extends BaseTest.
  3. Add setup statements so that all cases execute the same as before.
  4. Add new 'merged' test class where we are running all these cases against merged configurations.

Commits that did this:

IMO these are good tests. Mostly because they immediately failed when I added merged cases. Then they failed until I got the services implemented right. Secondly because our expectations towards ConfigDao and CachedGoConfig did not change, we still want them to operate as they did when config was only in a file, even when now configuration datastore is file + config repos.

tomzo commented 9 years ago

@arvindsv Initially you mentioned that this feature should be merged in parts

[...] it'll be easier to see progress and maybe merge smaller bits of code than the whole thing at once.

Do you still think it would be best to merge a few branches over some time - e.g. api first, then schema, then some lower services then higher, then rails, etc. Or since I am merging regularly from master anyway maybe it could be merged all at once in the end?

I am asking because I do not know if I should still put effort into maintaining many branches.

arvindsv commented 9 years ago

Hi @tomzo. If it is possible, keep it separate. If it's getting too hard, then merge them all in and let's handle it later. If it is separate, it'll be easier to concentrate on smaller parts for reviewing, but I can understand that it can be hard.

tomzo commented 9 years ago

OK. I think I can get in 3-4 pieces

tomzo commented 9 years ago

Little announcement

Yesterday I started 'Go development server' for the first time with merged configuration accepted as current valid. Pipelines 1,2,3 are actually polled and parsed from https://github.com/tomzo/gocd-indep-config-part . Others are defined in xml.

I have poked around, there is nothing major broken in this server. Basically cruise config with merged strategy works just fine. There are minor things like - UI lets to start editing these remote configs which makes no sense. On the good side it is also already possible to edit local pipelines with changes being saved to disk like before.

A little preview:

pipesfromrepo

Further decisions

Now that I know that general concept is possible at all, I'd like to get this feature finished at good quality level. There are many issues that we talked about so far, I think I know how to deal with most of that already. However I have a few considerations left and I'd like to hear some opinions on that.

SCM-consistency

This will be first thing to deal with now. I need to make sure that when pipeline is scheduled then it has the right revision. First I will definitely extend logging messages so that we can see revision and source of pipeline configuration. We have already listed a few conditions where pipelines should not be scheduled, I will not repeat it here again. I can implement service to be able to tell that, but please help me identify which services should I glue to it. From my perspective implementing this comes down to question - Where is the right place to say NO when pipeline is about to be scheduled? Is the SchedulingCheckerService ? It seems like there are a few reasons there to not schedule. I could just add a few more. @jyotisingh maybe you could point a few lines.

Merge on timer

@arvindsv what do you think about this: So far there was a timer in CachedGoConfig and when there were changes on disk they were attempted to be imported. We could extend this approach - let's just periodically check if there is any change in GoRepoConfigDataSource. There would be some threads (from MDU or dedicated) that would independently parse each repository leaving the result in GoRepoConfigDataSource. This would significantly reduce number of attempts to merge a new configuration. As opposed to any event-driven implementation when each partial config update causes attempt to merge again.

Threads to poll, parse and merge

Some facts

My current favorite approach is to:

I also like the idea that parsing should be done in the MaterialUpdateService in the onMessage method. But that would mean only one thread to parse the config repositories. (Is it like that? one listener has just one thread to take MaterialUpdateCompletedMessage from queue?)

Which one do you prefer?

Pull requests

I have config-api changes almost ready. I'd like to do some more implementation higher to be sure that this is all what is needed. Then I will start making many PRs with what I consider to be working and worth looking at.

arvindsv commented 9 years ago

Little announcement

He he. That's a "little" understatement. I think it's a huge achievement, especially having seen some of the lots of work that has gone into it. Congratulations!

Question about SCM consistency and scheduling

I think @jyotisingh and @srinivasupadhya can help better. But, I think it'll have to be around BuildCauseProducerService (after simplifying a method there) or around ScheduleService. My guess is around here, after simplifying that method.

Merge on timer

Yes, I think that should work. In a sense, it is the same as the MDU polling timer. We still need to handle long-running updates, etc. But, I like the separation of the code materials and config material (code material polling not slowing down config polling). Since the materials can be same though, there needs to be synchronization between the two. Cannot be polling material 1 "for code", while the config poller is polling the same material for config.

Some facts

poller thread does not have to parse config directory during material update. It can finish like now, leaving a clean checkout for use. Then it is only a matter of waiting a moment in the MaterialUpdateService before taking material from the InProgress state.

I was thinking about this ... and there might be a problem if it takes too long.

Time T1: Config poller polls Time T2: Config poller is done, leaves in clean state Time T3: Commit happens Time T4: Config is being parsed (a little slowly) Time T5: Material poller polls same material Time T6: Material poller updates material on filesystem (same one being parsed)

That could be an issue, right?

However, I see that you say this "Then it is only a matter of waiting a moment in the MaterialUpdateService before taking material from the InProgress state". From that, I understand you're saying that this should happen in onMessage (as you mention later). I think that should work. I don't think there's only one thread for it. But, you should check it, I feel.

tomzo commented 9 years ago

He he. That's a "little" understatement. I think it's a huge achievement, especially having seen some of the lots of work that has gone into it. Congratulations!

Thanks. I'm glad you like it.

line in BuildCauseProducerService

I have seen this method in BuildCauseProducerService already. I am most confused about how these build cause producers, scheduler and checker services interact with each other. There was a nice doc on the MDU, but no such thing for them.

it is the same as the MDU polling timer.

Good point. I was thinking about separate timer, but this sounds much better.

But, I like the separation of the code materials and config material (code material polling not slowing down config polling). Since the materials can be same though, there needs to be synchronization between the two. Cannot be polling material 1 "for code", while the config poller is polling the same material for config.

I definitely agree with last sentence - it cannot be config pollers and code pollers both responsible for the same repo. I still plan to solve that by just adding config repos next to standard scm materials so that they participate in the material update queue and whole MDU process. But this makes them equal in terms of speed. I don't feel that strongly about material polling slowing down new configs, but as user I have never seen Go performance statistics or had that many scm materials to overload server. I'll take your word for this that we need config polling to work faster than just plain code polling.

So we should have some sort of prioritization for materials so that config repos are polled first and fast. Obviously if they would be all in the same queue and all workers are the same then we can do nothing about it. Perhaps this would help:

Do you think that would be enough? This might not help with config pollers being faster (the unloading thread can still be slow).

Another option would be something like this - lets create 25 workers instead of current 10, but 15 of them can unload only materials with config. If they have no config repo to unload then they sleep.

As far as I understand code around here is the place where 10 listener threads and MDU workers are allocated to unload material update queue. So that would be something more specialized from now on.

That could be an issue, right?

I meant waiting awhile until configuration parsing is done - not just sleeping. So that material update is in progress until we have partial config object. So no such issue IMO. It is based on what you proposed here

arvindsv commented 9 years ago

I have seen this method in BuildCauseProducerService already. I am most confused about how these build cause producers, scheduler and checker services interact with each other. There was a nice doc on the MDU, but no such thing for them.

I'll write about this over the weekend and put it here. Will try to do it later today, itself.

... I'll take your word for this that we need config polling to work faster than just plain code polling.

Ideally:

  1. 10 material poller threads are available for polling
  2. When a config material is being polled, and until it is parsed (but not necessarily merged), another poll for the same material should not happen - either as config or code. That material should not be removed from the inProgress queue till the plugin is given a chance to parse it.
  3. Ideally, while parsing is happening, one material poller thread is not used. Meaning, 10 are still available. But, the point above still applies.

Do you agree with that? If so, then some kind of queue priority might be in order.

About your idea of 15 config-only threads, that means a different queue, that's all, right? There is no relation to the material queue any more. A material can only be in one of the queues. Each has its own threads, and each has its own onMessage handler. I think this approach is better than the other one, where the listeners have to decide which queue to unload first. Also, that can cause starvation of the materials' queue, if the config queue is always full.

arvindsv commented 9 years ago

Just for the record, we talked about my previous post here.

tomzo commented 9 years ago

I have quite sensible implementation of all services and behaviors done already with some integration tests done. I think this might help people reviewing the code in all pull requests.

config-api changes are described in #1276

Here is a summary of new services layout:

Below GoConfigService

Services which I added but removed in the end (please ignore them):

How does it work?

These services on top of new API make it possible to return a merged cruise config to higher services mostly without them knowing.

Main point

The best analogy to get the whole point here is that MergeGoConfig has replaced the old CachedGoConfig. It used to be that CachedGoConfig had 2 instances of configuration in memory (for edit and current config). Now there is MergeGoConfig that has these two. But main difference is that MergeGoConfig may return merged configuration as current config or for edit. If there are no extra configuration parts then it returns the main configuration.

Handling edits

Merged cruise config is returned for edits. When some service is editing the config it does not know if the config is merged or not. It does not have to know.

Adding

When method to add pipeline or environment is made then it reaches merged cruise config at some point. It is then aware that we meant to add in the main part and changes the main config instance (inside the merge cruise config instance).

Removing

Removing is like adding. We can localize where to remove from. If user tries to remove remote element then it fails. Usually it would fail in the cruise config code.

Modifications

Modifications get complex because there are many ways in which they are introduced. This is where there is real benefit from returning merged config instance. Changes are made on the config instance in full merged context so that when anything invalid is attempted then it will throw. E.g. when trying to change name of pipeline group defined remotely.

Saving changes

Each config edit ends with attempt to save some config for edit instance (or deep clone of it, or clone of a clone, etc.). To deal with that - magical writer is aware of possibility that merged config might be passed to be serialized. If so then it takes out only locally defined configuration elements. Actual extraction of local elements is implemented in config-api and it is very easy because we keep and maintain the main configuration instance inside merged config anyway.

Commits with low services

Here is comparison where new schema and *low services are implemented - https://github.com/tomzo/gocd/compare/1133-api-validations-rails-pr...tomzo:1133-services (new schema is irrelevant to this but that's best comparison at the moment)

Work status

I have well covered test cases for backwards compatibility - meaning all test cases that ran on old services now pass on new services. But these cases have no remote configuration repositories. I have added (too) few cases where configuration repository is defined.

TODO

Tests, tests, tests... We should have unit and integration tests where there is remote and local configuration source and they are both populated. There may be references between parts and part->main. In such conditions we should assert stuff like

[higher services will be next comment]

tomzo commented 9 years ago

Second materials queue and higher services

This is implemented mostly how we discussed here

New material update queue

Unloading queues

ConfigMaterialUpdater - new component

Added new component - ConfigMaterialUpdater which listens on config-material-update- completed topic. So when MDU is done then ConfigMaterialUpdater gets its chance to work with material being updated:

Commits

https://github.com/tomzo/gocd/compare/1133-minimal...tomzo:1133-pollers

Work status

This is well implemented with unit and integration tests.