grafana / synthetic-monitoring-app

Synthetic Monitoring frontend application
https://grafana.com/docs/grafana-cloud/how-do-i/synthetic-monitoring/
GNU Affero General Public License v3.0
120 stars 17 forks source link

Tech debt: Cleanup and unify the way we utilize the three necessary datasources across the plugin #846

Open ckbedwell opened 2 months ago

ckbedwell commented 2 months ago

This is a broad ticket and may be best approached by dividing it up into sub-tasks. Part of the work will be investigating the best solution and working out a manageable way to approach the problem, please feel free to create new tickets / subtasks as appropriate when working on a solution.

Problem

The SM plugin relies on three datasources to be working correctly:

Recently there are two tickets related to issues caused when an instance has multiple Loki Datasources set-up in a particular fashion. Whilst investigating and researching the issues in the codebase it became apparent there is no singular point of reference which should definitively be used to retrieve each datasource.

Some places use api.getLogsDS, some places use meta.jsonData.logs from the meta object provided by the InstanceProvider and some places utilise the logsInstanceName also provided by the InstanceProvider which looks to just be a duplication of what we get from meta.jsonData.logs.grafanaName and Scenes use instance.logs.uid to pass to the query objects. It is confusing logic to follow and decipher and I believe is a source of the problem for the two bugs listed below.

Related tickets: https://github.com/grafana/synthetic-monitoring-app/issues/844 https://github.com/grafana/synthetic-monitoring-app/issues/845

We also have many type assertions across the code base relating to what is provided by the InstanceProvider which would be nice to address.

Completion criteria

  1. Abstract and unify how we lookup each datasource and utilize its provided api
  2. Ensure everywhere across the app consistently uses the same abstracted method (Scenes might have to continue to do its own thing)
  3. Remove the type assertions related to the InstanceProvider across the application to increase DX and developer confidence.
  4. Appropriate tests have been added

Additional context

Some groundwork might be necessary in researching what solutions / assumptions have been made previously to support on-prem and Cloud customers when it comes to the three datasources.

I am especially interested in cleaning up whatever is going on in useAppInitializer in regards to manually looking up and matching datasources. The logic in this hook is very opaque and difficult to follow and it is unclear why it is even needed if an app provisioning plugin file is provided which explicitly references the datasources the app is to use. Is there a situation where the plugin provisioning file doesn't exist? Do we have to create a solution where users can choose what DS instances they want to initialise the SM plugin using?

ckbedwell commented 3 weeks ago

https://github.com/grafana/synthetic-monitoring-app/pull/911

And extracting the relevant write-up:

Synthetic Monitoring relies on two supporting databases to work at full capacity: a Grafana Cloud-hosted prometheus database and a Grafana Cloud-hosted Loki database.

A Grafana Cloud instance automatically provisions two datasource 'connectors' which connect to each respectively, confusingly referred to as a Prometheus Datasource and a Loki Datasource. In theory these should all remain stable, are named predictably and are immutable.

The problem which lies with the plugin is that it based on certain assumptions which may be untrue in certain situations (it is worth noting they are rare but do and will continue to exist and some clients specifically ask to bypass these assumptions):

  • That the Synthetic Monitoring tenant is using the default set-up
    • This assumption becomes incorrect if someone were to update their Synthetic Monitoring database provisioning file
  • That the provisioning jsonData for the Synthetic Monitoring Datasource is the source of truth for these connections.
    • This assumption becomes incorrect if someone has utilised the SM api /api/v1/tenant/update route to reassign different instances to associate with
    • This assumption also becomes incorrect if the default provisioned datasources have been renamed in some way (i.e. altering their provisioning files)

The next problem that the Synthetic Monitoring plugin faces is that the metrics and logs api are queried via their datasources within Grafana. We query datasources by their UIDs, however the SM tenant endpoint (which is the real source of truth) does not concern itself with associating with datasources but with Grafana Cloud-hosted database instances.

The api can provide the hostedId of the database instances it is writing to but it has no knowledge of what datasources a Grafana Cloud instance has which it uses to query against.

If we do not wish to rely on the assumptions above to know what datasources we can utilise it is currently up to to the plugin to ask the SM tenant endpoint and search through a user's available datasources and ask for the hostedIds of each (in this case the basicAuthUser === hostedId). It is worth noting this may not always be reliable because a user may have access to multiple datasources which share the same hostedId but have different access policies.

mem commented 1 week ago

(in this case the basicAuthUser === hostedId)

Note that that is not enough.

A single database (Mimir, Loki) instance can be configured more than once as a datasource, using the same user. Each datasource can be given different access permissions. The reason why you might want to do this is LBAC: if you provide each datasource with a different token, each datasource might have access to a different portion of the data.

I was looking for the ticket that mentions this, but it's eluding me. We need to add a drop down to the SM dashboards allowing the user to select the datasources they want to use, based on what they have access to. For the majority of users this will be a single one. For a smaller subset, this might be multiple ones. In that situation the problem is how to match the correct Mimir instance with the correct Loki instance. For our internal use case we shrug it off and say "it's up to you make sure it matches" (we have many many dashboards where this happens).