grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.11k stars 526 forks source link

Alertmanager local storage templates support #5352

Open antonio-tolentino opened 1 year ago

antonio-tolentino commented 1 year ago

Alertmanager local storage implementation does not support external template files.

I have included some lines of code for pkg/alertmanager/alertstore/local/store.go to support this feature. Please find the code suggested marked by "BEGIN CUSTOM" and "END CUSTOM" comments

func (f *Store) reloadConfigs() (map[string]alertspb.AlertConfigDesc, error) {

    configs := map[string]alertspb.AlertConfigDesc{}
    err := filepath.Walk(f.cfg.Path, func(path string, info os.FileInfo, err error) error {
        if err != nil {
            return errors.Wrapf(err, "unable to walk file path at %s", path)
        }

        // Ignore files that are directories or not yaml files
        ext := filepath.Ext(info.Name())
        if info.IsDir() || (ext != ".yml" && ext != ".yaml") {
            return nil
        }

        // Ensure the file is a valid Alertmanager Config.
        _, err = config.LoadFile(path)
        if err != nil {
            return errors.Wrapf(err, "unable to load alertmanager config %s", path)
        }

        // ############## BEGIN CUSTOM ##############
        // Load all template files from user templates dir (DATA_DIR/anonymous/templates)
        templates := []*alertspb.TemplateDesc{}
        templatesDir := strings.TrimSuffix(path, info.Name()) + "templates"
        templateFiles, err := ioutil.ReadDir(templatesDir)
        if err != nil {
            return errors.Wrapf(err, "unable to read alertmanager templates from directory error: %s", err)
        }

        for _, f := range templateFiles {
            templateFileName := templatesDir + "/" + f.Name()
            templateContent, err := os.ReadFile(templateFileName)
            if err != nil {
                return errors.Wrapf(err, "unable to read alertmanager template file %s", templatesDir+"/"+f.Name())
            }
            templateDesc := alertspb.TemplateDesc{
                Filename: f.Name(),
                Body:     string(templateContent),
            }

            templates = append(templates, &templateDesc)
        }
        // ############## END CUSTOM ##############

        // Load the file to be returned by the store.
        content, err := os.ReadFile(path)
        if err != nil {
            return errors.Wrapf(err, "unable to read alertmanager config %s", path)
        }

        // The file name must correspond to the user tenant ID
        user := strings.TrimSuffix(info.Name(), ext)

        configs[user] = alertspb.AlertConfigDesc{
            User:      user,
            RawConfig: string(content),
            // ############## BEGIN CUSTOM ##############
            Templates: templates,
            // ############## END CUSTOM ##############
        }
        return nil
    })

    return configs, err
}

As alternative I am using inline templates, but when the templates are a bit more complex it is quite difficult to maintain them.

Code tested on top of version 2.9.0

antonio-tolentino commented 1 year ago

Hi @dimitarvdimitrov,

Do you think it is possible to include these pieces of code to support templates when Alertmanager is configured to use local Storage ?

dimitarvdimitrov commented 1 year ago

That makes sense to me. But someone from @grafana/mimir-ruler-and-alertmanager-maintainers would be better positioned to make this call. Steve, Josh, can you take a look at this proposal?

stevesg commented 1 year ago

I see the motivation for it, if constrained to the "local" store type. It would be a nice usability win for new users who want to test Mimir in a development or single-tenant environment.

That being said, I'm in two minds about it. My concerns would be:

  1. We have considered removing the local store type, as it goes a little against what the Mimir Alertmanager solves, that is, to be a multi-tenant Alertmanager.
    • We assume that users (tenants) do not have access to the deployment of Mimir (e.g. the filesystem).
    • Alertmanager (and other Mimir components, e.g. the ruler) have user-facing APIs to configure them.
  2. With this change, the two types of configuration file will be slightly incompatible. e.g. You could not migrate a local configuration which reads templates from files, to an object-store based configuration, which does not. However, this problem is mitigated somewhat in mimirtool, as when uploading an Alertmanager configuration, it reads templates from individual files.

I think I'd be happy with the change so long as the file structure supported by the local store type is identical to that supported by mimirtool. Therefore, you could either point Mimir directly at disk, or upload the same configuration with mimirtool.

As an alternative, what about an option to mimirtool to do the "inlining"; it would emit the full configuration file. That takes away from the ease of use aspect though.

antonio-tolentino commented 1 year ago

Hi @stevesg,

First of all, thank you very much for your clarifications about Alermanager and concerns.

Just to clarify a bit more about my environment, we are trying to adopt not only Mimir, but the whole Grafana stack.

At this moment the solution is being deployed in single tenant in all environments and all of the configurations are based in Kubernetes configmaps that are injected in pods through sidecar.

If we had this resource it would be great in order to avoid the creation of inline templates and also to inject them as volumes, while we don't have a multi-tenant environment.

Questions Do you intend to remove local storage support soon ?

stevesg commented 1 year ago

Do you intend to remove local storage support soon ?

I don't think so - there appear to be many users that do as you want, put the configuration in a ConfigMap and mount it directly into Mimir.

Would you like to attempt creating a PR for this? What you have is essentially complete, we just need to write a unit test around it. I'd be happy to help guide you on the right path. I'd suggest copying this test:

https://github.com/grafana/mimir/blob/394d9bbbd7f5b7850c15cb146c4861b56e458a4b/pkg/alertmanager/alertstore/local/store_test.go#L48-L73

And creating a new test called something like TestStore_GetAlertConfigWithExternalTemplates, for example.

antonio-tolentino commented 1 year ago

Hi Steve, That sounds great! I will right the unit test code and as soon as I have this working I will return to you and you help me with PR process.

Thanks again!