grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.7k stars 3.42k forks source link

Mark all alternative stores but TSDB as deprecated #9105

Closed wardbekker closed 1 year ago

wardbekker commented 1 year ago

Describe the bug Object storage with TSDB (also called single store) is the recommended default going forward with 2.8+. In the docs there is still a lot of references to Cassandra/Bigtable/DynamoDB/BoltDB that might set a new Grafana Loki user on the wrong foot. Recommend to mark all references to those explicitly legacy/deprecated in the docs to remove any confusion

periklis commented 1 year ago

m2cents mark anything but boltdb-shipper :heart:

wardbekker commented 1 year ago

@periklis just to clarify, you prefer both the TSDB index and the boltdb-shipper to be marked supported, and the rest deprecated?

wardbekker commented 1 year ago

btw. @JStickler I'm planning on creating a PR for this.

periklis commented 1 year ago

@periklis just to clarify, you prefer both the TSDB index and the boltdb-shipper to be marked supported, and the rest deprecated?

Yes that is my intention. As per both will need to run in parallel for some installations out there. At least till 3.0 it wouldn't hurt keeping both stores supported.

timansky commented 1 year ago

If TSDB is now primary support storage, i didn't found any TSDB configs in tanka or helm installation(Neither in docs or config/values files).

brophyja commented 1 year ago

As a new Loki user, I can confirm that the list of documented back-end options is confusing, even if someone has decided to use S3/object storage.

For example, I have seen several comments "around the internet" that table-manager is going to be deprecated. My search to find specific details on the Loki roadmap or component life-cycle led me to this issue.

While I would like to be able to use DynamoDB for the Loki index (thus requiring table-manager) to limit the need for Persistent Volumes in a Kubernetes based deployment, I would like to know now if this combination of features is not going to be supported in the near future (we are still evaluating Loki).

If TSDB is the future, and components like table-manager are going to be deprecated, then these decisions should be clearly documented somewhere!

liguozhong commented 1 year ago

hi, all.this is a very important suggestion. I found in the upstream system cortex that there is an upper limit for a single tenant of cortex in block mode, up to 20 million metrics. Because the compactor is a bottleneck, does our tsdb block implementation have similar problems (the data size of a single tenant cannot exceed a threshold). Although the design of loki should not have too many labels, once there are too many labels, will the compactor in loki become the bottleneck of the loki system?

The latest cortex project deletes the code of cassandra before the bottleneck of the compactor is resolved, which prevents us from introducing the cortex project into our infrastructure. I'm very worried about something like this happening in the loki project. We should slow down our pace on removing cassandra code. cc @owen-d

Currently, I am running a loki tenant with a huge log ingest rate of about 3Gb/s. But I can't know how many labels there are. I am running the index module in cassandra. Cassandra's powerful scalability has no single point of bottleneck, and I am very relieved of it when operating larger log data volumes.

Therefore, my suggestion is whether we can not delete the code of the Cassandra index part so quickly in the future. Various systems such as the single-point bottleneck problem of the compactor, cortex thanos mimir, etc. have made various attempts. Can we wait for this compactor problem to be completely resolved before delete Cassandra code?

https://aws.amazon.com/cn/blogs/opensource/scaling-cortex-with-parallel-compaction/ https://github.com/cortexproject/cortex/pull/4843

bmarinov commented 1 year ago

The situation with the outdated documentation and the feeling for a 'moving target' with regards to recommendations is unfortunate. I almost choked on my coffee reading that boltdb is now apparently deprecated.

I am running 2.7.x versions in several environments, and just today I deployed 2.7.4 yet again (the version running in prod) on staging for some extensive troubleshooting. This is a version from February this year, and at that point in time there were only a few references of TSDB. Boltdb was the way to go.

What im trying to communicate is a legitimate problem with the developer UX with regards to documentation, configuration and getting Loki up and running in a setup suitable for production.

Personally, I am happy to see boltdb go - since we moved to 2.7.x and configured compaction, we've had a ton of issues with queries. Turning on alerts with regular checks for the conditions exacerbated the problem greatly, making the alerts unusable:

And this is in the simplest possible, monolithic setup. Basically, a setup which should be easy to deploy and operate.

This is not meant as criticism, but having a well organized documentation for several common scenarios, deployment topologies and the recommended storage backend (at the time) should be of very high priority. Having to slap together the configuration from several different pages and sources is time-intensive and does not inspire much confidence.

TL DR: sensible defaults should be preconfigured, and complete configuration examples for several common scenarios should be available, discoverable and kept up to date. Roadmap transparency would also be a big plus.

jerryjvl commented 1 year ago

I could not agree more with the sentiment above.

The hardest challenge I've found in setting up Loki in production (although based on flipping through docs, I expect the same to hold true for Mimir and Tempo once I get around to those), is the fact that the Documentation and Examples are very in-cohesive. When I try to correlate example configurations, or official templates to the recommendations in the docs to try to understand how to extrapolate partial setups into a full cohesive setup for my own environment, I'm finding I have to constantly back-track and iterate as I discover that parts of the advice I had incorporated have been outdated by changes elsewhere that weren't clearly signposted.

I love the power of Loki, but it is extremely difficult to synthesize a functioning configuration file from the docs and examples.

And I'm certainly not averse to putting my effort where my mouth is, but based on the drift in the existing content, I'm not confident any external party could keep the docs aligned with the moving target of internal developmental changes to how key parts of the configuration function together.

liguozhong commented 1 year ago

The situation with the outdated documentation and the feeling for a 'moving target' with regards to recommendations is unfortunate. I almost choked on my coffee reading that boltdb is now apparently deprecated.

I am running 2.7.x versions in several environments, and just today I deployed 2.7.4 yet again (the version running in prod) on staging for some extensive troubleshooting. This is a version from February this year, and at that point in time there were only a few references of TSDB. Boltdb was the way to go.

What im trying to communicate is a legitimate problem with the developer UX with regards to documentation, configuration and getting Loki up and running in a setup suitable for production.

Personally, I am happy to see boltdb go - since we moved to 2.7.x and configured compaction, we've had a ton of issues with queries. Turning on alerts with regular checks for the conditions exacerbated the problem greatly, making the alerts unusable:

  • attempting to query broken chunks ends up in 'database not open' and similar errors, going further back in time is OK
  • aforementioned broken chunks getting lost (gap in historical data which was queryable just a few moments ago)
  • race conditions with the compactor? index_12345/1683...400: no such file or directory
  • (Loki) error rate spikes whenever recent data is being queried (while being compacted?)
  • Error = failed to execute query A: rpc error: code = Unknown desc = database not open

And this is in the simplest possible, monolithic setup. Basically, a setup which should be easy to deploy and operate.

This is not meant as criticism, but having a well organized documentation for several common scenarios, deployment topologies and the recommended storage backend (at the time) should be of very high priority. Having to slap together the configuration from several different pages and sources is time-intensive and does not inspire much confidence.

TL DR: sensible defaults should be preconfigured, and complete configuration examples for several common scenarios should be available, discoverable and kept up to date. Roadmap transparency would also be a big plus.

👍 Thanks for sharing such a great tsdb migration experience. I am worried about encountering these things you said in my production loki cluster. so I am still stuck in Cassandra. We are still understanding the source code of tsdb to understand more details in order to better deal with it appeal problem.