Open ckbedwell opened 4 months ago
https://github.com/grafana/synthetic-monitoring-app/pull/911
And extracting the relevant write-up:
Synthetic Monitoring relies on two supporting databases to work at full capacity: a Grafana Cloud-hosted prometheus database and a Grafana Cloud-hosted Loki database.
A Grafana Cloud instance automatically provisions two datasource 'connectors' which connect to each respectively, confusingly referred to as a Prometheus Datasource and a Loki Datasource. In theory these should all remain stable, are named predictably and are immutable.
The problem which lies with the plugin is that it based on certain assumptions which may be untrue in certain situations (it is worth noting they are rare but do and will continue to exist and some clients specifically ask to bypass these assumptions):
- That the Synthetic Monitoring tenant is using the default set-up
- This assumption becomes incorrect if someone were to update their Synthetic Monitoring database provisioning file
- That the provisioning
jsonData
for the Synthetic Monitoring Datasource is the source of truth for these connections.
- This assumption becomes incorrect if someone has utilised the SM api
/api/v1/tenant/update
route to reassign different instances to associate with- This assumption also becomes incorrect if the default provisioned datasources have been renamed in some way (i.e. altering their provisioning files)
The next problem that the Synthetic Monitoring plugin faces is that the metrics and logs api are queried via their datasources within Grafana. We query datasources by their UIDs, however the SM tenant endpoint (which is the real source of truth) does not concern itself with associating with datasources but with Grafana Cloud-hosted database instances.
The api can provide the
hostedId
of the database instances it is writing to but it has no knowledge of what datasources a Grafana Cloud instance has which it uses to query against.If we do not wish to rely on the assumptions above to know what datasources we can utilise it is currently up to to the plugin to ask the SM tenant endpoint and search through a user's available datasources and ask for the hostedIds of each (in this case the
basicAuthUser === hostedId
). It is worth noting this may not always be reliable because a user may have access to multiple datasources which share the same hostedId but have different access policies.
(in this case the basicAuthUser === hostedId)
Note that that is not enough.
A single database (Mimir, Loki) instance can be configured more than once as a datasource, using the same user. Each datasource can be given different access permissions. The reason why you might want to do this is LBAC: if you provide each datasource with a different token, each datasource might have access to a different portion of the data.
I was looking for the ticket that mentions this, but it's eluding me. We need to add a drop down to the SM dashboards allowing the user to select the datasources they want to use, based on what they have access to. For the majority of users this will be a single one. For a smaller subset, this might be multiple ones. In that situation the problem is how to match the correct Mimir instance with the correct Loki instance. For our internal use case we shrug it off and say "it's up to you make sure it matches" (we have many many dashboards where this happens).
This is a broad ticket and may be best approached by dividing it up into sub-tasks. Part of the work will be investigating the best solution and working out a manageable way to approach the problem, please feel free to create new tickets / subtasks as appropriate when working on a solution.
Problem
The SM plugin relies on three datasources to be working correctly:
Recently there are two tickets related to issues caused when an instance has multiple Loki Datasources set-up in a particular fashion. Whilst investigating and researching the issues in the codebase it became apparent there is no singular point of reference which should definitively be used to retrieve each datasource.
Some places use
api.getLogsDS
, some places usemeta.jsonData.logs
from the meta object provided by theInstanceProvider
and some places utilise thelogsInstanceName
also provided by theInstanceProvider
which looks to just be a duplication of what we get frommeta.jsonData.logs.grafanaName
and Scenes useinstance.logs.uid
to pass to the query objects. It is confusing logic to follow and decipher and I believe is a source of the problem for the two bugs listed below.Related tickets: https://github.com/grafana/synthetic-monitoring-app/issues/844 https://github.com/grafana/synthetic-monitoring-app/issues/845
We also have many type assertions across the code base relating to what is provided by the InstanceProvider which would be nice to address.
Completion criteria
Additional context
Some groundwork might be necessary in researching what solutions / assumptions have been made previously to support on-prem and Cloud customers when it comes to the three datasources.
I am especially interested in cleaning up whatever is going on in
useAppInitializer
in regards to manually looking up and matching datasources. The logic in this hook is very opaque and difficult to follow and it is unclear why it is even needed if an app provisioning plugin file is provided which explicitly references the datasources the app is to use. Is there a situation where the plugin provisioning file doesn't exist? Do we have to create a solution where users can choose what DS instances they want to initialise the SM plugin using?