py-why / causal-learn

Causal Discovery in Python. It also includes (conditional) independence tests and score functions.
https://causal-learn.readthedocs.io/en/latest/
MIT License
1.13k stars 186 forks source link

Handling Data with Interventions #181

Open chrisquatjr opened 5 months ago

chrisquatjr commented 5 months ago

Hi,

Thank you for the excellent repository! It has been exciting exploring the discovery tools in this repository lately. I was using the data from Sachs et al. 2005 before I realized the data was already implemented as an internal dataset here. I am running into some confusion, however, and thought I would ask here.

For clarity's sake, here is how I am loading in the internal dataset:

from causallearn.utils.Dataset import load_dataset
data, labels = load_dataset(dataset_name="sachs")
df_internal = pd.DataFrame(data=data,columns=labels)

From what I can tell, the internal implementation of the dataset is some subset of the 14 excel tables one retrieves if they download from the paper directly. First, I noticed the internal dataset contains exactly the same columns as all 14 of the excel tables I have from the paper. The rows, by contrast, differ substantially. There are on the order of 11 thousand rows present across all 14 tables but there are only around 7 thousand rows present in the internal dataset. (I did also confirm that the first 5 rows of the internal dataset match exactly to those of the 1. cd3cd28.xls file, so it does not look like any normalization/processing has altered the values themselves).

Taken together, it seems the internal dataset is a row-joined subset of the original Sachs dataset. Is this a correct assessment? If so, what subset of the tables are included? Why aren't all conditions included?

Please let me know if I have simply missed some tutorial or documentation somewhere. Any assistance would be greatly appreciated.

Overall, my goal is to reproduce the graph seen in Figure 3A. I know the authors used a simulated annealing approach, but I want to try more current approaches.

jdramsey commented 5 months ago

Hi, Yujia asked if I would give you an answer. This data we used for this tech report:

https://arxiv.org/abs/1805.03108

There is another data file here with the experimental variables given for each row:

https://github.com/cmu-phil/example-causal-datasets/blob/main/real/sachs/data/sachs.2005.with.jittered.experimental.continuous.txt

The experimental variables are actually all 0/1; they have simply been jittered with a small amount of Gaussian noise to allow matrix inversions using them to not yield singularity exceptions. You can recover 0/1 by thresholding them with a threshold of 0.5.

I think the issue with the number of rows is that one of the datasets in the Sachs paper was not used for their analysis, so we omitted it as well. If you can't figure out which one that was, let me know; I'll go through my notes. I think it was...10?

Let me know if that helps.

Best,

Joe

On Mon, Apr 22, 2024 at 5:13 PM chrisquatjr @.***> wrote:

Hi,

Thank you for the excellent repository! It has been exciting exploring the discovery tools in this repository lately. I was using the data from Sachs et al. 2005 before I realized the data was already implemented as an internal dataset here. I am running into some confusion, however, and thought I would ask here.

For clarity's sake, here is how I am loading in the internal dataset:

from causallearn.utils.Dataset import load_dataset data, labels = load_dataset(dataset_name="sachs") df_internal = pd.DataFrame(data=data,columns=labels)

From what I can tell, the internal implementation of the dataset is some subset of the 14 excel tables one retrieves if they download from the paper directly. First, I noticed the internal dataset contains exactly the same columns as all 14 of the excel tables I have from the paper. The rows, by contrast, differ substantially. There are on the order of 11 thousand rows present across all 14 tables but there are only around 7 thousand rows present in the internal dataset. (I did also confirm that the first 5 rows of the internal dataset match exactly to those of the 1. cd3cd28.xls file, so it does not look like any normalization/processing has altered the values themselves).

Taken together, it seems the internal dataset is a row-joined subset of the original Sachs dataset. Is this a correct assessment? If so, what subset of the tables are included? Why aren't all conditions included?

Please let me know if I have simply missed some tutorial or documentation somewhere. Any assistance would be greatly appreciated.

Overall, my goal is to reproduce the graph seen in Figure 3A. I know the authors used a simulated annealing approach, but I want to try more current approaches.

— Reply to this email directly, view it on GitHub https://github.com/py-why/causal-learn/issues/181, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACLFSR52SVH22AWOWZ3Q3I3Y6V4NVAVCNFSM6AAAAABGTQXYVKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TONBXGM2TSMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

chrisquatjr commented 5 months ago

Thank you for the great explanation! I have been following the paper you suggested and have been able to follow everything using Tetrad's GUI, which I switched over to as I do not see an implementation of FASK in this library (let me know if I simply missed it). I followed the paper up to this point:

"After running FASK, we deleted the intervention variables from the resulting graph keeping only the graph over the measured variables."

I am not sure how to do this in Tetrad. I do not see anything in the manual about deleting or removing variables in this way. This format also does not appear to conform to Tetrad's "status and value" convention. If I adjust the data to conform to this format, would Tetrad immediately know to not include these variables in the graph output?

jdramsey commented 4 months ago

Oh my gosh, I missed your message! Let me think how to respond.

jdramsey commented 4 months ago

Ah I see. Here's the data:

https://github.com/cmu-phil/example-causal-datasets/blob/main/real/sachs/data/sachs.2005.logxplus10.jittered.eperimental.continuous.txt

The intervention variables are all variables after 'jnk'--these are experimental variables that have been jittered with a small amount of Gaussian noise.