py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
7.15k stars 935 forks source link

Question about Support for Survival/Time-to-Event Data #1285

Open lict99 opened 4 days ago

lict99 commented 4 days ago

I am writing to express my appreciation for the excellent work on the package, which has greatly facilitated causal inference in Python. As a user of the package, I have been able to successfully apply it to various datasets and problems.

However, I was wondering if it would be possible to extend DoWhy's capabilities to support survival or time-to-event data? Currently, the package appears to focus on traditional outcomes such as binary, continuous, or count responses. Time-to-event data is a common outcome type in many fields (e.g., medicine, economics, sociology), and I believe that supporting this would greatly enhance the utility of DoWhy.

I understand that adding new features can be a significant undertaking, but I was hoping to get some insight into whether there are any plans to support survival analysis or if you could recommend alternative packages or methods for causal inference with time-to-event data. Any advice or resources you could share would be greatly appreciated.

Thank you again for your hard work on the package.

amit-sharma commented 4 days ago

Can you provide a motivating example or dataset on which you'd like to run DoWhy?

Supporting new kinds of data is significant work. So we can try to do this step-by-step: first, let's understand a popular, high impact scenario where we can extend DoWhy, and then later we can support survival analysis fully.

lict99 commented 2 days ago

Survival data typically comprises two key components: time (the duration from the start of an observation period to either an event occurrence, study end, loss of contact, or withdrawal) and status (indicating whether an event has occurred or if censoring has taken place). I've found several popular datasets on Kaggle datasets. Specifically:

  1. The Breast Cancer Survival Dataset contains a clear distinction between the patient's status (Patient_Status column) and time (interval between Date_of_Surgery and Date_of_Last_Visit). Other variables within this dataset can be used as potential predictors.image
  2. The Cirrhosis Patient Survival Prediction dataset features status (Status column) and time (N_Days column), with other variables available for use in predictive modeling.image

Additionally, I've found a helpful introduction to survival analysis on the wiki, which provides a solid starting point for understanding this topic.

Thank you for your attention to this matter.😊