Closed ronikobrosly closed 18 hours ago
To clarify, I mean that at each time step or unique observation, and for each node there could be multiple values (e.g. for node #1, in the first observation, the latency was 0.5 seconds, the CPU usage was 0.80, and some other metric was 1.25.
It seems to me this could be more holistic, but I'm sure it makes the math more challenging.
Hi, there are different ways to think about this. One could simply see a node as multivariate (i.e., it has vectors of observations instead of single values), but DoWhy does not support this yet. Another perspective is to further 'unroll' the graph by incorporating the relationships between all these metrics. For instance, a metric like 'latency' might cause the latency of the calling node to change, but another metric like 'request count' is exactly the opposite, where the number of requests in a calling node causes the number of requests to increase in the child node, etc. So, optimally, we have a big graph where all these metrics are connected with each other. 'Latency' in a Website might be caused by 'Request' or 'CPU usage' of that website, etc.
Thanks so much @bloebp , this makes a lot of sense! If you don't mind I have a follow-up question: Do you know of any implementation somewhere of structural models that handle multivariate/vector nodes?
I am not aware of implementations that support that (i.e., a functional causal model per node that supports multivariate outputs). I think the general challenge for this is that the underlying model (e.g., a regression model in the case of additive noise models) needs to support this, and this means you need to restrict the types of models that are supported.
Multivariate support has been on my to-do list for a long time, but it is not straightforward to adjust the algorithms to the multivariate cases.
I appreciate your thoughts. Thank you @bloebp
I'd love to take a stab at incorporating this sort of approach into dowhy with you, if you're open to it @bloebp . I'd have a spend a bit of time learning your API more closely at first naturally.
Yea, sure, that would be awesome! Let me know if you have any questions/get stuck, I believe there will be some subtle issues here and there. Also, feel free to message me on the PyWhy discord directly.
Okay fantastic @bloebp . I'll find you on Dischord and reach out.
Take for example the microservice latency RCA demonstration in the dowhy documentation (https://www.pywhy.org/dowhy/v0.8/example_notebooks/rca_microservice_architecture.html)
It's a fantastic example, but latency is just one of the "golden signals" used to determine service health. There's also status code counts (2xx, 4xx, 5xx), traffic counts (how many users are visiting each microservice). Oftentimes in the observability space, we have all of this data at our disposal. Is there a way to incorporate these multiple data points for each node to do a more holistic root cause analysis? And if that's not available through the dowhy API itself, do you recommend an hacky approach to handle this? Thank you!