kube-reporting / metering-operator

The Metering Operator is responsible for collecting metrics and other information about what's happening in a Kubernetes cluster, and providing a way to create reports on the collected data.
Apache License 2.0
339 stars 86 forks source link

Question: AWS Cost and Usage Reports & Resource IDs #439

Open moserke opened 6 years ago

moserke commented 6 years ago

The AWS Cost & Usage Report, do the Resource IDs need to be included? I can not find a definitive answer on this. My gut says yes because this is how correlation will be done, but would like to confirm.

chancez commented 6 years ago

Yes. We use the resourceId's on the ec2 cost rows for correlation. The specific part where we use it is here: https://github.com/operator-framework/operator-metering/blob/master/charts/reporting-operator/templates/custom-resources/report-queries/aws-billing.yaml#L34-L40.

moserke commented 6 years ago

Thanks! I thought that might be the case. This makes for really large cost reports. Is there anything to keep in mind in terms of metering performance with large reports?

chancez commented 6 years ago

We already partition the table containing the cost report by month, so that generally helps. Generally it's probably best to keep the reportingStart and reportingEnd limited to about a month. You may want to increase the memory of Presto, and it may help to run some dedicated worker replicas (which we don't have documented currently).

We're working on making it easier to aggregate across existing reports too, which will make doing roll-ups from many, smaller reports easier.

Lastly, we're still working on getting this deployed to one of our larger environments (thousands of namespaces, 100+ nodes) but, that's one of our top priorities and we'll be looking to document anything related to scaling we get from that process.

moserke commented 6 years ago

Are workers something that can be configured in the Metering object? Or is this something that will have to be managed ourselves?

moserke commented 6 years ago

Is there a "tuning" documentation somewhere? I think that would be really helpful for this project. Great stuff though, really easy to get started!

chancez commented 6 years ago

It's something you can configure on the metering object, (this is where it's undocumented). And yes, the literal goal of us trying to deploy to a larger environment is to write the tuning document your describing. It's difficult, because we expose quite a few knobs, but we don't necessarily want users to be using all of them since we ideally automate the need to tune these things. However, there is a gap right now, so it may be useful to document the knobs we expect are most likely for someone to want.

Here's a snippet of a custom configuration that I use. I expect that not everything here is necessary for you, but it should give you some tunables that you can mess with. https://gist.github.com/chancez/db4e2e4e5f7bcb20e195b439e0f5acf1

The key parts are anything with the replicas field set is one you may wish to adjust. Setting resources is very common, this is documented already. taskMaxWorkerThreads is also useful, and translates to task.max-worker-threads in https://prestodb.io/docs/current/admin/properties.html. We're actively working on this aspect of our documentation, so we'll keep this open and update you when we get more documentation related to this.

moserke commented 6 years ago

Awesome, thanks so much for this! I did find the worker info by poking through the helm charts.

On a side note, I was not able to use 0.8.0-latest because it seems to be hitting https://issues.apache.org/jira/browse/HADOOP-13811. We can not auth to AWS because of a mismatch of lib versions. I was able to go back to 0.7.0 and get things working. (I can open a new issue on this if that's better)

chancez commented 6 years ago

Can you open another issue with the pod logs of the components your seeing the errors in? Also, if you can provide your configuration with the credentials set, but replaced with fake values, that's useful too, just so I can verify everything is set correctly, in all the right spots.

chancez commented 6 years ago

https://github.com/operator-framework/operator-metering/pull/442 Should fix the auth issue you're experiencing. It was related to a recent refactor that accidently removed some environment variables from a few pods.

moserke commented 6 years ago

Confirmed, things are working on the 0.8.0-latest now! Thanks!