mlco2 / codecarbon

Track emissions from Compute and recommend ways to reduce their impact on the environment.
https://mlco2.github.io/codecarbon
MIT License
1.12k stars 173 forks source link

Monitoring of Spark emissions via Spark plugin #600

Open tvial opened 3 months ago

tvial commented 3 months ago

Hi,

I am working on a prototype of a Spark plugin to report the energy consumption of executors. The logic behind is similar to CodeCarbon's, although the computation method differs slightly: the executor process scheduling is sampled regularly, converted to Wh with the TDP (provided or inferred), and aggregated by the driver. The total energy is published as a Spark metric, accessible via the REST API.

I wanted to know if you'd be interested in integrating it with CodeCarbon, for example with Spark cluster as a new type of resource alongside CPU, GPU, or RAM. It would let CC factor in the energy mix and cloud provider data, which could be cumbersome to access from a private Spark cluster (it's better not to assume internet connectivity). And it would benefit from CC's ease of use, which is a strong factor of adoption.

In any case, it's a prototype, it needs more testing and validation, and only handles CPU for now (but many data engineering pipelines don't use GPUs anyway). Here it is: https://github.com/tvial/ccspark (Apache 2.0 license). Note that it embeds your CPU database for the TDPs, I'm open to remove it if you think it's a bad idea :)

Let me know if it can be of any help Thanks

SaboniAmine commented 3 months ago

Hello Thomas, that's a great idea! Thanks for this proposal, that would be appreciated by a lot of potential users. I'll have a deeper look on the implementation, but here are some initial questions.

tvial commented 3 months ago

Hi Amine, it's been a while :) Glad to hear from you as well!

Thanks for the encouraging feedback.

It's working as it is, I tested it locally and on a small dedicated Databricks cluster on Azure, both with very simple jobs (no real usage as of yet). I see no challenges in making it work within a CI or other environment, as it has no dependencies by itself.

Regarding the measurements, it does not use RAPL for the reason you mention. I think I read somewhere that some Databricks configs would let you run executors as root, but I would not make this a requirement, maybe an alternative method? The one here is to read scheduled jiffies from proc/stat and proc/$pid/stat, to take the difference between two samplings, and compute the ratio as the load attributed to the process over the sampling period. It should be reviewed by someone more expert in Linux and Spark's execution model.