Recommendations for Open Source Analytic (OLAP) system / API to mine Thanos/Prometheus data.

thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.

https://thanos.io

Apache License 2.0

12.79k stars 2.04k forks source link

Recommendations for Open Source Analytic (OLAP) system / API to mine Thanos/Prometheus data. #2682

Closed bwplotka closed 3 years ago

bwplotka commented 4 years ago

Hello Community :wave:

This topic was actually started by @maulikjs here but I would love to reshape the topic to the main, key question here: Does anyone have any recommended open-source system or API for the report (e.g OLAP cubes) building from various data sources?

Problem Statement

Looks like there are in our Prometheus community many use cases to leverage Thanos long term storage data for more advanced analytics purposes. The problem here is that Prometheus (so Thanos) is not designed for such use out of the box. Query API was designed for real-time queries, for monitoring purposes, which as a side effect allows simple analytics and long term characteristics thanks to Thanos downsampling.

The PromQL however is too limited for general analytics purposes and our "REST" query API is simply not designed to mine large data sets. This is causing indeterministic resource consumption of such use. The other problem is data discoverability. For analytics, it's necessary to know the "schema" you are working with, whereas the Prometheus Query API that Thanos serves does not allow other discovery than just by very expensive querying (e.g /label/<name>/values and label/names exist but are very inefficient until this).

Motivation

There are many systems in the wild and it's also outside of my expertise. But since we have a huge community nowadays I am sure someone has some experience to share, or some preferences. Essentially: If you would have an integration with some warehouse / OLAP open-source solution for Thanos, what it would be?

Our personal interest is that nowadays we are actually slowly looking as well for some analytic tool, potentially OLAP kind of open-source system that will allow us to produce meaningful, multidimensional reports mainly from metrics. It's worth noting that metrics data would be just one data source, but actually the requirement is to join the data from various sources, like segment.io, Prom based Metrics, potentially logs, and others. Looks like

Requirements for such System:

Must-have:

[ ] Open Source License :nerd_face:
[ ] Allow producing multidimensional reports from years of data (not in real-time, can be a longer job)
[ ] Allow good discoverability of what dimensions we have available
[ ] Allow joining data from multiple data sources

Nice-to-have:

[ ] Will work without replicating metrics data. It would be nice if we could just build an API on top of StoreAPI or TSDB format instead of copying and translate the whole dataset to another storage system. I think we might not be able to avoid that, but who knows? :thinking:

Summary

I personally have only experience with Google BigQuery, but it's not open source :hugs: Does anyone know/recommend some system that meets those requirements and is willing to share their experience? How you analyze data nowadays? From segment, etc ?

cc @brancz @tomwilkie @beorn7 @kbsingh @pracucci @squat @domgreen

bwplotka commented 4 years ago

Proposed as a topic for the next Prometheus Ecosystem Community Meeting

Also shared on prometheus-user and cncf-sig-observability@lists.cncf.io

by-kpv commented 4 years ago

@bwplotka - I have explored the OLAP option for high cardinality multi dimensional slice/dice data use cases - specifically DRUID and Clickhouse. In practice Clickhouse provides comparatively less operation overhead. Also, another use case which had me incline for OLAP was for applying data cube model to integrate loosely coupled data sets and semi structured data sets. Happy to share more details at the community meeting. And many thanks for surfacing this concept up.

raravena80 commented 4 years ago

(Sent this to the CNCF SIG-Observability ML too)

This is a great idea. Keep in mind that OLAPs are not necessarily used for monitoring and observability. For example, in the past, I worked on implementing Apache Druid to collect mobile analytics. In this space, I can think of these projects: Druid, Pinot, Kylin, Clickhouse, Modrian, Cubes (There might be others) Druid, Pinot, and Kylin are already part of the Apache Foundation so that leaves others that we could approach to join the CNCF.

Having said that because OLAP systems can be quite complex, there are multiple components that may fall into the scope of other CNCF SIGs. For example, storing historical data (SIG-Storage), running your batch processor workers (SIG-Runtime), serving your real-time and historical data (SIG-Network).

In any case, it would be great to approach the different projects so that the CNCF community is aware of how OLAPs work and foster general interest.

robskillington commented 4 years ago

Adding my response from SIG-Observability too:

Glad to hear this topic brought up, it's something we think a lot about and have some experience with previously (running OLAP queries against monitoring and observability data).

At Uber we ETL'd subsets of data users wanted to do large processing on into an existing platform. The data warehouse there supported Spark and Presto for interactive queries (i.e. pull raw data matching query at at query time) and HDFS (ingest raw data as it arrives via Kafka into HDFS and ETL/query there).

I'd love to see a project that was Prometheus Remote Read -> Spark for interactive or batch ETL against Prometheus data. You could use Spark ETL to also move data into other warehouses such as HDFS, Druid, Pinot, etc. Also Prometheus Remote Read -> Presto could be interesting, although Presto focuses more on interactive queries vs say Spark.

The major issue with other systems in this space tends to be owning the whole data pipeline that results, e.g. Thanos/Cortex/M3/ecosystem would need to support an ongoing export of data into another stateful system such as Druid, Pinot, Clickhouse, etc. You also then have to now store the data in these other warehouses with smart policies, otherwise a lot of users end up just whitelisting all of the data to be warehoused. Typically this ends up with really large datasets existing in two different systems and a significant investment to keep the pipeline flowing between the two.

That is why I think seeing projects that support interactive and ETL that operate on the dataset from the Prometheus metrics store itself and then save elsewhere being quite interesting, rather than warehouse the whole dataset themselves.

pachico commented 4 years ago

From the requirements you listed I think ClickHouse might be the solution you are looking for. I have been using it successfully in production for several months already and I take full advantage of joining multiple data sources. It's setup is very straight forward. Things get a bit more complex when you need sharding or replication since it relies on zookeeper, which means having another moving part.

bwplotka commented 4 years ago

The major issue with other systems in this space tends to be owning the whole data pipeline that results, e.g. Thanos/Cortex/M3/ecosystem would need to support an ongoing export of data into another stateful system such as Druid, Pinot, Clickhouse, etc. You also then have to now store the data in these other warehouses with smart policies, otherwise a lot of users end up just whitelisting all of the data to be warehoused. Typically this ends up with really large datasets existing in two different systems and a significant investment to keep the pipeline flowing between the two.That is why I think seeing projects that support interactive and ETL that operate on the dataset from the Prometheus metrics store itself and then save elsewhere being quite interesting, rather than warehouse the whole dataset themselves.

@robskillington you touched a very exciting part. This is actually the amazing novelty we would love to push toward as well. Instead of storing the same data in 5 places can we keep it in just one? The idea would be to promote efficient streaming read API more vs copying the data to different formats. This might mean more work on those Thanos/Cortex/M3/ecosystem projects, but given we are already collaborating, it might be easier (: This is along the lines what we try to push on metrics/logs/tracing world as mentioned by my team colleague @brancz : Can we reuse similar index for those three since we observe collect different data.. but from the same resources?

I have been using it successfully in production for several months already and I take full advantage of joining multiple data sources.

Thanks for your opinion @pachico , definitely worth taking a look :+1:

pachico commented 4 years ago

@bwplotka let me know if I can assist you with some use cases or quick setups for you to try out. Cheers

cevian commented 4 years ago

I'd also humbly suggest looking at TimescaleDB. It is based on PostgreSQL - so supports full SQL for analytical (OLAP) or real-time queries, as well as joins with any other relational or time-series data. Several features make it appropriate for prometheus data analysis: elastic horizontal scale-out across multiple nodes to support petabyte sized workloads, native compression, and continuous aggregates (similar to Prometheus recording rules). Grafana natively ships with a TimescaleDB data source. Tableau, Looker, and other SQL-based viz tools also natively connect with TimescaleDB.

In particular, we just launched the Timescale-Prometheus project (design doc), which adds significant performance and usability improvements for using TimescaleDB with Prometheus data. (Would appreciate any feedback on the design doc.). We are also in the process of adding native PromQL support.

Would be great to find a way to work with Thanos. Currently, Prometheus could easily be set up to write to both Thanos and Timescale-Prometheus and queries could be directed to either as needed but I wonder if there is some more interesting integration we can develop.

Looking forward to discussing more at the Prometheus Ecosystem Community Meeting later today, if the opportunity arises.

ozanminez commented 4 years ago

Hello,

I was also thinking -for a long time- about using Prometheus as primary metrics storage in my metrics analytics/ML project which includes Apache Spark for main computing. Because of Prometheus is inaccessible natively from Spark or HDFS/Hadoop I could not use Prometheus at that time. I know each projects has roots on different ecosystems but I believe it will be really powerful combination if Prometheus and Spark can join their forces.

I think it will be more proper and relatively easy if we have a S3 compatible api to expose metrics as objects. Because HDFS/Hadoop binaries has native support for S3 api. Spark uses hadoop binaries, so metrics will be easily accesible from Spark applications.

I will look more in detail but with a quick search i have found Minio object storage project; it has an S3 compatible api and it is written in Go and it is Apache licensed open source project. I have looked but i couldn't find how they are implemented S3 api probably because i have very limited Go skills :)

I am also too new to Prometheus codebase but when i look a little; i see we use heavily Go objects and interfaces to communicate with storage. I guess -and much likely- it will be very hard -and probably unnecessary performance effects- to implement a new api or method to extend storage functions in real storage operations in Prometheus. So i understand if we do not choose that method. I also do not see any reason or value for doing that for Prometheus itself besides analytics concerns.

In the other hand if anyone has a good knowledge on how S3 api works and how it can be implemented, -and for sure who also knows how Prometheus storage works- these effords can be done in another seperate branch which has aim to provide only a read-only data exposure via this api with no or minimal affect to project.

I will look further for some missing areas that i mentioned earlier. If no one volunteers on this I will try to implement a working solution if i can manage to overcome the knowledge barrier about Go, Prometheus Storage and S3 api. But even if i can not do that i will be around here to share more detailed ideas or references to codes about how it can be implemented. Please let me know if it seems interesting or if there is a previous choice or limitation to do it, feel free to feedback. All kind of feedback is welcomed :)

Best regards.

ozanminez commented 4 years ago

After I read "Writing a Time Series Database from Scratch" by Fabian Reinartz, I have seen there is some logic with chunks and merge them properly at query time is already implemented on PromQL engine/query part. So instead of writing a low level api that reads directly from blocks, it will be much easier to write a wrapper on top of PromQL query results to serve them as S3 api. And file names on S3 request can be exactly the same PromQL query statement. So there is no reason to be at core of the project. It can be a separate application.

I am aware of that i am drifting from the idea of long term query which is base of this issue. My main idea was always share data between Apache Spark and Prometheus more natively in order to leverage Spark's ML capabilities. But it seems not possible without a radical change on how storage works on Prometheus part. So i will continue to work seperately on my project to write or find a way to expose Prometheus metrics as S3 api even if it is not performant or native. Best regards.

bwplotka commented 4 years ago

Thanks all for your feedback!

Looks like the general actionable starting points would be to:

[ ] Pick one / two open-source OLAP systems or API, preferably something that you already use heavily on production.
[ ] Design and develop together PoC of easy to use integration point with Thanos and Prometheus.

So far in this discussion I gathered, multiple ones:

Clickhouse @by-kpv, @raravena80, @robskillington
Druid @by-kpv, @robskillington @raravena80, @robskillington
Pinot @raravena80, @robskillington
Kylin, @raravena80
Modrian @raravena80
Cubes @raravena80
Hadoop @robskillington
TimescaleDB @cevian

We also had quite a nice discussion on the last Prometheus Ecosystem Meeting. You can find the recording here: hhttps://twitter.com/PofkeVe/status/1268788026064416769 During the call (also in this thread), we had feedback from @robskillington who mentioned two projects which would be some nice idea to start with:

Spark
Presto

What might be superior in the case of those two projects, (Presto and Spark) is that AFAIK, they process dataset without the need of replicating the same data just in a different format. On top of it Spark already integrates with other systems (including TimescaleDB) so might a good "proxy" as well.

It looks like also this fits into what @ozanminez is saying:

I was also thinking -for a long time- about using Prometheus as primary metrics storage in my metrics analytics/ML project which includes Apache Spark for main computing. Because of Prometheus is inaccessible natively from Spark or HDFS/Hadoop I could not use Prometheus at that time. I know each projects has roots on different ecosystems but I believe it will be a really powerful combination if Prometheus and Spark can join their forces.

Yes! That sounds quite epic.

I will look more in detail but with a quick search i have found Minio object storage project; it has an S3 compatible api and it is written in Go and it is Apache licensed open source project. I have looked but i couldn't find how they are implemented S3 api probably because i have very limited Go skills :)

Yup we use minio client already to actually work fetch metrics already from S3 compatible storages (e.g AWS S3). Also, I am missing why you need to "implement S3 API". What do you mean by how it could help us? For Thanos data is already in object storage, so it's kind of in the correct place already. (: I Feel free @ozanminez to reach me on slack (@bwplotka) I can definitely give a hand on how to follow up on that as we already have some ideas how to connect to Spark. The last time I was working next to Spark was 5 years ago while contributing to Mesos... :see_no_evil: we would use some experience on this!

I actually went ahead and created #analytics channel on CNCF workspace, we can collaborate there as well. (: cc @pachico

As per TimescaleDB @cevian I love the idea of improving the integration with Prometheus, although there are many moving bits, and we might have some suggestions in terms of scalability! :heart: : I am just curious, can you link us to some analytic use of TimescaleDB? Since it's on progress I would see it being superior for OLTP rather than OLAP. Any sources to follow up, how it solves analytics use cases? :hugs:

To sum up:

[X] Join #analytics channel for some offline discussions
[X] If there is someone else who might have some thoughts on what system we should start with, ideally something you have production experience with, please speak up! Maybe there is a system we are missing? Maybe we should actually not look at Spark as a first iteration? :thinking: Let us know!
[X] If you have production experience with Spark and APIs, let us know as well.

ozanminez commented 4 years ago

@bwplotka Thanks for offering collaboration. I will reach you when I have more progress on a working solution. I continue to read the codebase of existing Spark integrations of products like Elasticsearch,Cassandra,Couchbase etc.

The S3 api that I mentioned earlier was not for accessing external storage to store TSDB files. Even if TSDB files are stored in an existing S3 provider, files are in custom format and Spark/Hadoop is not aware of this TSDB file format. The S3 api that I offered was just a read only serving api of metrics data that stored in HDD for OLAP systems or applications because Spark/Hadoop has native support for accessing S3 api compatible storage. But it assumes TSDB files (Chunks etc.) are already converted to human readable format (e.g. Csv or Parquet File) via new code on top of PromQL/HTTP api. So the application in my mind could use standard client libraries of Prometheus. It’s only aim is to expose ready to use metrics data via human readable format. Of course it is more likely to be non-efficient regarding performance concerns. This S3 api will be just “easy to use” if anyone wants to integrate with another system and leverage existing codebase which includes Spark jobs or Python scripts. I will personally continue to work on this idea but I am not offering it anymore in scope of this issue because it seems to have become another separate integration client application.

Yesterday I found another low level approach that is not mentioned before. But it is more related to this issue in regard to the “zero-copy” idea. It seems much more efficient and hard to implement. But I believe it will be worth all the effort. This solution includes memory level conversion using the Apache Arrow project.

A quick Apache Arrow project summary from its website and Wikipedia: Arrow can be used with Apache Parquet, Apache Spark, NumPy, PySpark, pandas and other data processing libraries. The project includes native software libraries written in C++, C# .NET, Go, Java, JavaScript, and Rust with bindings for other programming languages, such as Python, R, and Ruby. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems.

Process flow of objects and codes that i am offering is something like this: Prometheus TSDB Files > (Existing Prometheus Code) > Prometheus Storage API Memory Format > (New Code) > Apache Arrow Memory Format > (Existing or New Apache Arrow Code -According to their documentation it is easy. Not validated by myself-) > Any of Spark/Pandas/Drill/Impala/HBase/Cassandra/Kudu/Parquet File/Memory Format

Apache Arrow has bindings for Go: https://godoc.org/github.com/apache/arrow/go/arrow

I just wanted to mention that option before we choose an option to start if someone finds it exciting or has prior technical expertise at memory level.

Best regards.

bwplotka commented 4 years ago

Oh wait so S3 API you mean some layer that will expose bytes in the desired format of the target system (e.g Spark) but when API is called we actually create those bytes on the fly? (: That's actually neat idea if made in a high performant way, might be feasible. Is that what you mean?

This solution includes memory level conversion using the Apache Arrow project.

Interesting will look as well.

Anyway, thank you! just make sure Ozan you send us some message on slack so we can sync better, please (:

Kind Regards, Bartek

On Sun, 7 Jun 2020 at 02:57, Ozan Minez notifications@github.com wrote:

@bwplotka https://github.com/bwplotka Thanks for offering collaboration. I will reach you when I have more progress on a working solution. I continue to read the codebase of existing Spark integrations of products like Elasticsearch,Cassandra,Couchbase etc.

The S3 api that I mentioned earlier was not for accessing external storage to store TSDB files. Even if TSDB files are stored in an existing S3 provider, files are in custom format and Spark/Hadoop is not aware of this TSDB file format. The S3 api that I offered was just a read only serving api of metrics data that stored in HDD for OLAP systems or applications because Spark/Hadoop has native support for accessing S3 api compatible storage. But it assumes TSDB files (Chunks etc.) are already converted to human readable format (e.g. Csv or Parquet File) via new code on top of PromQL/HTTP api. So the application in my mind could use standard client libraries of Prometheus. It’s only aim is to expose ready to use metrics data via human readable format. Of course it is more likely to be non-efficient regarding performance concerns. This S3 api will be just “easy to use” if anyone wants to integrate with another system and leverage existing codebase which includes Spark jobs or Python scripts. I will personally continue to work on this idea but I am not offering it anymore in scope of this issue because it seems to have become another separate integration client application.

Yesterday I found another low level approach that is not mentioned before. But it is more related to this issue in regard to the “zero-copy” idea. It seems much more efficient and hard to implement. But I believe it will be worth all the effort. This solution includes memory level conversion using the Apache Arrow project.

A quick Apache Arrow project summary from its website and Wikipedia: Arrow can be used with Apache Parquet, Apache Spark, NumPy, PySpark, pandas and other data processing libraries. The project includes native software libraries written in C++, C# .NET, Go, Java, JavaScript, and Rust with bindings for other programming languages, such as Python, R, and Ruby. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems.

Process flow of objects and codes that i am offering is something like this: Prometheus TSDB Files > (Existing Prometheus Code) > Prometheus Storage API Memory Format > (New Code) > Apache Arrow Memory Format > (Existing or New Apache Arrow Code -According to their documentation it is easy. Not validated by myself-) > Any of Spark/Pandas/Drill/Impala/HBase/Cassandra/Kudu/Parquet File/Memory Format

Apache Arrow has bindings for Go: https://godoc.org/github.com/apache/arrow/go/arrow

I just wanted to mention that option before we choose an option to start if someone finds it exciting or has prior technical expertise at memory level.

Best regards.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/2682#issuecomment-640144250, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3O6EOPOAIC3SITJCXK3RVLXXRANCNFSM4NNJQAVA .

himanshpal commented 4 years ago

@bwplotka - Prometheus connector was recently merged into prestosql.io project & its also now available in latest release as well as supports pushdown at connector level. If we can just build upon it and improve it, I guess that should also work.

Having a remote-write aync producer & coupled with prometheus connector can be used for serving realtime aggregated metrics which is generally generated by flink-streaming sql or spark structured-streaming sql at 1min internal. I had worked on a poc & so far it worked well

More here

himanshpal commented 4 years ago

Oh wait so S3 API you mean some layer that will expose bytes in the desired format of the target system (e.g Spark) but when API is called we actually create those bytes on the fly? (: That's actually neat idea if made in a high performant way, might be feasible. Is that what you mean? This solution includes memory level conversion using the Apache Arrow project. Interesting will look as well. Anyway, thank you! just make sure Ozan you send us some message on slack so we can sync better, please (: Kind Regards, Bartek … On Sun, 7 Jun 2020 at 02:57, Ozan Minez @.***> wrote: @bwplotka https://github.com/bwplotka Thanks for offering collaboration. I will reach you when I have more progress on a working solution. I continue to read the codebase of existing Spark integrations of products like Elasticsearch,Cassandra,Couchbase etc. The S3 api that I mentioned earlier was not for accessing external storage to store TSDB files. Even if TSDB files are stored in an existing S3 provider, files are in custom format and Spark/Hadoop is not aware of this TSDB file format. The S3 api that I offered was just a read only serving api of metrics data that stored in HDD for OLAP systems or applications because Spark/Hadoop has native support for accessing S3 api compatible storage. But it assumes TSDB files (Chunks etc.) are already converted to human readable format (e.g. Csv or Parquet File) via new code on top of PromQL/HTTP api. So the application in my mind could use standard client libraries of Prometheus. It’s only aim is to expose ready to use metrics data via human readable format. Of course it is more likely to be non-efficient regarding performance concerns. This S3 api will be just “easy to use” if anyone wants to integrate with another system and leverage existing codebase which includes Spark jobs or Python scripts. I will personally continue to work on this idea but I am not offering it anymore in scope of this issue because it seems to have become another separate integration client application. Yesterday I found another low level approach that is not mentioned before. But it is more related to this issue in regard to the “zero-copy” idea. It seems much more efficient and hard to implement. But I believe it will be worth all the effort. This solution includes memory level conversion using the Apache Arrow project. A quick Apache Arrow project summary from its website and Wikipedia: Arrow can be used with Apache Parquet, Apache Spark, NumPy, PySpark, pandas and other data processing libraries. The project includes native software libraries written in C++, C# .NET, Go, Java, JavaScript, and Rust with bindings for other programming languages, such as Python, R, and Ruby. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems. Process flow of objects and codes that i am offering is something like this: Prometheus TSDB Files > (Existing Prometheus Code) > Prometheus Storage API Memory Format > (New Code) > Apache Arrow Memory Format > (Existing or New Apache Arrow Code -According to their documentation it is easy. Not validated by myself-) > Any of Spark/Pandas/Drill/Impala/HBase/Cassandra/Kudu/Parquet File/Memory Format Apache Arrow has bindings for Go: https://godoc.org/github.com/apache/arrow/go/arrow I just wanted to mention that option before we choose an option to start if someone finds it exciting or has prior technical expertise at memory level. Best regards. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2682 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3O6EOPOAIC3SITJCXK3RVLXXRANCNFSM4NNJQAVA .

Yes, Arrow can be used for reading & writing the data in prometheus tsdb format.

Arrow Flight is another approach for serving the same data without needing serialisation & deserialisation using grpc with arrow format across machines

raravena80 commented 4 years ago

Yes, Arrow can be used for reading & writing the data in prometheus tsdb format.

This would be awesome. It would also be great to find out the memory limitations depending on the amount of tsdb data that needs to be processed for a given case.

cevian commented 4 years ago

Hi @bwplotka, Let me second @ozanminez in saying that the offer of collaboration is great. This is what open-source is all about.

I'd love to hear your suggestions about scalability. If you don't want to pollute this thread I am also on CNCF slack(@Matvey Arye) or email(mat@timescale.com). Would really appreciate any suggestion you may have.

I am just curious, can you link us to some analytic use of TimescaleDB? Since it's on progress I would see it being superior for OLTP rather than OLAP. Any sources to follow up, how it solves analytics use cases?

The vast majority of our use-cases are analytics in nature. For example, LAIKA uses TimescaleDB for metric analysis spanning years; SAKURA Internet is another relevant use-case. While Postgres was originally built for OLTP, it has always had language features for OLAP analysis and the past few releases have added JIT support to speed up these types of queries. On top of that, TimescaleDB has added two feature to speed up this type of analysis: continuous aggregates (a feature akin to recording rules) and columnar compression.

While TimescaleDB isn't built for fast full scans over all your data like some data-warehouse systems, it performs well for narrow queries (long time range range over subsets of data) as well as wide queries (small time range over broad sets of data). It also avoids the long ingest/ETL of many data warehouse solutions. In a time-series context we find that this trade-off fits well user's analytical needs where users often want dashboard over what's happening right now as well as analytics over long time horizons for some subset of data dimensions and properties (e.g. specific metrics or series).

afilipchik commented 4 years ago

What about letting Spark/Presto access chunks directly and write a plugin for them that can understand native Prom file format. This will scale if storage scales (and S3/GCS/... are pretty scaleable) and will not bottleneck on Thano's compute.

Fetching data from OLTPs during OLAP query times is harder to scale as load it not easily predictable + scans are much faster when they are executed over raw storage (Cassandra -> Spark integration is a good example)

szucsitg commented 4 years ago

I want to add two more options that haven't been considered. VictoriaMetrics (https://victoriametrics.com/) which is more mature and QuestDB (https://questdb.io/)

bwplotka commented 4 years ago

@afilipchik Spark/Presto plugin sounds like something quite interesting indeed!

@szucsitg how VictoriaMetrics is more mature? (: In what element? VM is quite nice but it's not analytic oriented either. Especially last time we checked there is no good long term storage story (disk storage only). In terms of quest DB , not sure, worth checking out (:

valyala commented 4 years ago

how VictoriaMetrics is more mature? (: In what element? VM is quite nice but it's not analytic oriented either

VictoriaMetrics supports exporting of raw data via /api/v1/export, so the exported data may be fed into analytics systems. The export format is JSON lines. There are plans to add ability to export data in parquet file format in the future.

Especially last time we checked there is no good long term storage story (disk storage only)

It looks like apples (object storage vs block storage) and oranges (short-term vs long-term) are intermixed here :) VictoriaMetrics is designed as a long-term storage, which works perfectly on any block storage (bare metal disks, [GCP persistent disks](), EBS, etc.). It is possible to use S3 or GCS for VictoriaMetrics data via such projects as https://objectivefs.com/ .

bwplotka commented 4 years ago

VictoriaMetrics supports exporting of raw data via /api/v1/export, so the exported data may be fed into analytics systems. The export format is JSON lines. There are plans to add ability to export data in parquet file format in the future.

Nice @valyala ! Can you share details about the plan? How exactly it would look like? Just convert from VM format to parquet right? Asking because we are discussing those bits with @ozanminez and @iNecas (see #analytics channel on CNCF Slack). The current is to have way to do this without replicating data (:

ozanminez commented 4 years ago

Hello, I have recently finished my work of research on this topic. These are summary of all my research:

-Early -but-not-suggesting-now- Suggestion: Exposing metrics as files via S3 powered api. It is possible but I know it is an uncommon way. File names and directories can be csv or json metric files grouped by day or months in the file name and converted on-the-fly from prom-data to file.

-Preferred: Spark + Python/Pandas integration is preferred by me because of existing, rich ML + Anomaly Detection libraries.

-Finding: Direct Spark integration with a Java/Scala library is a common way but it is hard because Scala and Spark internals are highly related to both Scala and Spark versions.

-Alternative Suggestion: Batch job with historical data vs Streaming with additional new coming data. A simple command line tool can cover both cases in the name of aka “the unix way”. Having a functionality such as “tail”, --format option to have json or csv file

-Good to Mention: There is a whole analytics platform called “Iguazio” focused on ML on Prometheus data. They have nice tools and libraries which can be used or just for inspiration.

-Opinion: Apache Arrow is a good and powerful serialisation and data conversion library that can be used.

Regards.

iNecas commented 4 years ago

Nice summary. Some of my $0.02

Adding --format parquet as native support (including common compression as supported in https://github.com/xitongsys/parquet-go) would be super-useful.

When doing to extraction, one would probably need to specify the specific metric, frequency, aggregation function, perhaps some sub-labels to select (not unlike https://github.com/v3io/v3io-tsdb#how-to-use or https://m3db.github.io/m3/integrations/graphite/ - maybe different formats, but principle might be similar).

I'm wondering, if similar pattern as recording rules would work here as well, just instead of creating new metrics, it would be used for the exporting. In that case, one could do exporting on arbitrary PromQL query (dunno how feasible implementation-wise though).

From scaling-out perspective, it seems like allowing this operations over the stofage vs. doing API requests against thanos, seems like more flexible solution, letting the users use the best tool (storage, batch/stream processing) they have at hand to do the conversion.

bwplotka commented 4 years ago

Some opinion on slack from @ozanminez (:

Thanks. I have seen your question but I was having no answer to it at that moment. So I thought a little more. After a couple of sub-questions I found out that we are avoiding file creation and file copy. We have json, csv or parquet file formats that are common in the analytics world but not efficient in our case. Our data storage format is efficient but not common. So if we choose the way of the analytics world completely we have to convert our uncommon format to a common file format eventually. But on the other side on-the-fly conversion involves in-memory stuff and also some extra development with the internals other analytics tools. We can use a common temporary memory format like Apache Arrow. But eventually we should convert the data to their own internal in-memory formats. Pandas has an internal in-memory DataFrame format. Spark has also an internal in-memory RDD format. If we use even Apache Arrow, this leads us to converting uncommon format to other internal uncommon format but via temporary common memory format. Simply, it is not a common thing but even so every other data source application makes their own custom integration libraries to every other analytics tool. Writing a custom library is the most used common method for integration. In the library/connector approach, if we decide not to use temporary Apache Arrow format, it is the same method, converting our uncommon data format to another uncommon in-memory format of analytics tool directly. So finally, if we want a more common way, I think we should consider file creation with data format conversion to common file format options, not an API. In my mind I think an API is mostly a method of sharing/copying data. In that case sharing via API is also copying data over the network in a particular format. Main focus should be the format. Almost every analytics tool has the ability to read common file formats which are accessible on S3 api or physical, but not a common API without common file formats. As I have seen so far, analytics tools have no common API which allows data sharing/copying without involving files or without using internal in-memory data objects. My early S3 server suggestion involves file format conversion on-the-fly but avoids saving the actual file to the filesystem and making it accessible via S3 api as a common file format. But if we accept saving the file, it can be a simple standart command line tool which is included in our distribution. In the command line option Parquet file format is more efficient in other common file format options. Every data science notebook I have seen with python code with spark and pandas, use a common file format and access it via S3 api or use an additional custom written library to connect specifically to another particular database or application. This is not a definitive answer but mostly reflects the analytics world we are in now. I hope it helps. Please contact me if you have any more questions or requests and make sure to keep me in the loop if any new updates happen in that topic. Thanks again.

We also briefly discussed this on Thanos community meeting and some question is "how we will scale" which is fair point (: It sounds like this method involved bit of computation that might be reduced if we use just API straight from our format vs converting. Question is what would be this overhead - because functionally this makes sense.

Also this solution locks into Thanos, which is great from our side, but not so great from ecosystem community e.g @robskillington m3db or even kind of Prometheus (: Instead if we would work on some common remote read API and connector connecting to some common API (what?) this allows m3db, Cortex, Thanos and even Influx to use it easily. cc @pracucci @metalmatze

valyala commented 4 years ago

Can you share details about the plan? How exactly it would look like? Just convert from VM format to parquet right? Asking because we are discussing those bits with @ozanminez and @iNecas (see #analytics channel on CNCF Slack). The current is to have way to do this without replicating data (:

Yes, VictoriaMetrics will convert the requested data on the fly to Parquet format. For example, when client requests /api/v1/export/parquet?match[]=foo{bar="baz"}&start=...&end=..., then the handler would stream the requested data matching foo{bar="baz"} on the given time range [start...end] in Parquet format.

It is interesting to know about other approaches though...

ozanminez commented 4 years ago

After some thinking, I have another improved/combined proposal for the process and elements to make metrics available to other analytics and ecosystem projects :

[Files] Prometheus metrics storage folder and chunked files of metrics >>> [Our new serving/exporting application] Read storage folder and files directly from s3 api or physical machine and serve them as TCP server in a our new temporary in memory format (probably protobuf or optionally apache arrow in future) >>> [ Any other client application uses our new library for specific language/languages] Connect using some client library in Python/Java or etc. and read protobuf format and convert them in-memory standard language specific object format (list/dict/hashmap etc.) in target specific language or framework >>> Result is: integrated/usable/accessible metrics in any target language/framework

We can store and share/reuse all the logic of reading our internal metrics storage format and converting to standard objects of any programming language in one application. So it can be used in future by all our ecosystem projects. Sample usages can be like this:

our-new-cmd-application --export-metrics-as-file --format ... our-new-cmd-application --serve/expose-metrics-via-tcp-server --port ...

This solution can also scale, via launching multiple instances of application on different ports or different servers/containers. And clients can connect a list of servers in connection string to scrape all metrics data from different servers with a single library call.

bwplotka commented 4 years ago

Thanks! I think we might be going too low level here. The problem is with this:

Connect using some client library in Python/Java or etc. and read protobuf format and convert them in-memory standard language specific object format (list/dict/hashmap etc.) in target specific language or framework

In that case [Our new serving/exporting application] requires to implement conversion from "tmp memory model" to all existing formats in the world, which brings us to the same position we were with just TSDB blocks in S3, no? (: I think the key part is to make sure we keep simplicity, while we support major Analytics Systems :hugs:

ozanminez commented 4 years ago

Thanks! I think we might be going too low level here. The problem is with this:

Connect using some client library in Python/Java or etc. and read protobuf format and convert them in-memory standard language specific object format (list/dict/hashmap etc.) in target specific language or framework

In that case [Our new serving/exporting application] requires to implement conversion from "tmp memory model" to all existing formats in the world, which brings us to the same position we were with just TSDB blocks in S3, no? (: I think the key part is to make sure we keep simplicity, while we support major Analytics Systems 🤗

Yes, you are right, this solution is too low level but goal was to be available to every framework on that target language.

At least I can summarize and point out some parts/takings from this and my older solutions.

If we target only some of the analytics systems there are two options that comes to my mind:
1. If we accept file creation and if this file will be only one, it can be Parquet. Because it seems more efficient and almost-common besides csv or json.
2. If we don't want file creation we have to manually create in-memory and language-specific object type of the target framework with a library for every framework separately. (For Python Pandas: DataFrame, for Scala/Java Spark: RDD) Because there is no common in-memory object type on major analytics systems. (Extra Note: If we want to eliminate options more in that sense, this information can help a little: pyspark in Python has ability to convert Pandas DataFrames to Spark RDD. But in Spark world, native and efficient integration needs to be done in Scala/Java and directly create an RDD. This adds an additional step but it can be chosen if we accept this extra step.)
If we want scalability option in no-file (in-memory) option. We can use additional TCP server and client application architecture. I think, file-creation option seems not scalable.

To summarize all of my comments on this issue, it seems to me, there is no simple, scalable and efficient method for TSDB metrics to be accessible to most/all of the Analytics systems. In every option that comes to my mind, at least one item needs to be discarded.

Best Regards.

bwplotka commented 3 years ago

I did some research and found bit different solution.

"Recently" new Apache Arrow Flight gRPC API was introduced. This aims to essentially have and network protocol that is fast and efficient for columnar use. There are already clients e.g https://github.com/apache/arrow/blob/master/python/pyarrow/flight.py that are capable to fetch data through this and store in Arrow memory format. From this, there is support to all of the other projects like Spark, Panda, R, Parque, etc.

On top of that, there is https://github.com/apache/arrow/tree/master/go/arrow which is slowly build and improved by InfluxDB and might give similar capabilities (e.g writing to parquet file).

One idea would be to have a proxy that:

transforms Thanos StoreAPI to Arrow Flight or (in some next iteration) read data directly from storage from TSDB blocks and expose as ArrowFligh gRPC
transforms Prometheus remote read API to Arrow Flight, so we can support Prometheus directly.

With this, I think if m3db (@robskillington) would exposure gRPC Flight we would have some kind of unified API we can build the ecosystem on. Right now it's looks like it's there and mostly in Python, but flight is also supported in other languages slowly(e.g Java)

Thoughts?

Next steps would be to check gRPC Flight if it's good as they say and how hard is to e.g transform remote read into this (:

bwplotka commented 3 years ago

Separately to "what format we should use" there is also performance aspect. Started discussion about batch APIs here: https://github.com/thanos-io/thanos/issues/2948

iNecas commented 3 years ago

FYI here is some e2e code for exporting a prometheus metric into parquet using thanos Store API https://github.com/thanos-community/obslytics/pull/3. The interfaces should be generic enough to enable adding different input (Thanos store API, prometheus remote read) and output types (praquet, arrow…). At the core, there is the ingestion logic to transform the tsdb data into tabular format though some aggregations.

There are still a lot of rough edges, but should be enough to get something useful out of it to see what direction to push this forward.

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 3 years ago

Closing for now as promised, let us know if you need this to be reopened! 🤗