py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
6.89k stars 917 forks source link

Is there a way to work with a structural causal model where nodes have multiple values? For use in root cause analysis. #1222

Closed ronikobrosly closed 18 hours ago

ronikobrosly commented 5 days ago

Take for example the microservice latency RCA demonstration in the dowhy documentation (https://www.pywhy.org/dowhy/v0.8/example_notebooks/rca_microservice_architecture.html)

It's a fantastic example, but latency is just one of the "golden signals" used to determine service health. There's also status code counts (2xx, 4xx, 5xx), traffic counts (how many users are visiting each microservice). Oftentimes in the observability space, we have all of this data at our disposal. Is there a way to incorporate these multiple data points for each node to do a more holistic root cause analysis? And if that's not available through the dowhy API itself, do you recommend an hacky approach to handle this? Thank you!

ronikobrosly commented 5 days ago

To clarify, I mean that at each time step or unique observation, and for each node there could be multiple values (e.g. for node #1, in the first observation, the latency was 0.5 seconds, the CPU usage was 0.80, and some other metric was 1.25.

It seems to me this could be more holistic, but I'm sure it makes the math more challenging.

bloebp commented 4 days ago

Hi, there are different ways to think about this. One could simply see a node as multivariate (i.e., it has vectors of observations instead of single values), but DoWhy does not support this yet. Another perspective is to further 'unroll' the graph by incorporating the relationships between all these metrics. For instance, a metric like 'latency' might cause the latency of the calling node to change, but another metric like 'request count' is exactly the opposite, where the number of requests in a calling node causes the number of requests to increase in the child node, etc. So, optimally, we have a big graph where all these metrics are connected with each other. 'Latency' in a Website might be caused by 'Request' or 'CPU usage' of that website, etc.

ronikobrosly commented 4 days ago

Thanks so much @bloebp , this makes a lot of sense! If you don't mind I have a follow-up question: Do you know of any implementation somewhere of structural models that handle multivariate/vector nodes?

bloebp commented 4 days ago

I am not aware of implementations that support that (i.e., a functional causal model per node that supports multivariate outputs). I think the general challenge for this is that the underlying model (e.g., a regression model in the case of additive noise models) needs to support this, and this means you need to restrict the types of models that are supported.

Multivariate support has been on my to-do list for a long time, but it is not straightforward to adjust the algorithms to the multivariate cases.

ronikobrosly commented 4 days ago

I appreciate your thoughts. Thank you @bloebp

ronikobrosly commented 3 days ago

I'd love to take a stab at incorporating this sort of approach into dowhy with you, if you're open to it @bloebp . I'd have a spend a bit of time learning your API more closely at first naturally.

bloebp commented 1 day ago

Yea, sure, that would be awesome! Let me know if you have any questions/get stuck, I believe there will be some subtle issues here and there. Also, feel free to message me on the PyWhy discord directly.

ronikobrosly commented 18 hours ago

Okay fantastic @bloebp . I'll find you on Dischord and reach out.