thanos-community / obslytics

Tools and Services allowing seamless usage of Observability data from Prometheus, Thanos, Cortex, M3DB, Loki and more!
Apache License 2.0
31 stars 8 forks source link

Evaluate use of INT64 (and setting pq schema as UINT_64) #33

Open wseaton opened 2 years ago

wseaton commented 2 years ago

The INT64 Go type is being used in a lot of places, which translates to the UINT_64 logical type in parquet. This type is incompatible w/ Spark <= 3.1:

org.apache.spark.sql.AnalysisException: Parquet type not supported: INT64 (UINT_64)

This on it's own is fine, but it seems to be used in cases where the underlying data is not very large, say max value of 2048.0, is there a way to run an aggregation on the entire series before export and downcast to the smallest suitable logical type? Or maybe even issue a Prometheus query across a wider date range to grab the max, to help prevent schema (read: type) drift? I think this may be partly to blame for the large memory usage on export.

There's currently not a great spark workaround for the above since we need to use spark.read.option("mergeSchema", "true") to account for the schema drift internal to Prometheus. Best solution is to use bleeding edge Spark 3.2.0 which has it's own problems :grimacing: