huawei-noah / trustworthyAI

Trustworthy AI related projects
Apache License 2.0
950 stars 213 forks source link

Issue with generating synthetic dataset #95

Closed MHassaanButt closed 1 year ago

MHassaanButt commented 1 year ago

I have used the synthetic data generation script provided in this repository. While generating synthetic data, I have a few queries and concerns.

  1. Firstly, I would like to know whether the synthetic data generation script draws a causal graph and then generates data based on that, or whether it randomly generates data and then connects them based on ground truth?

  2. Secondly, when I generated data with default parameters and used Lingam noise type, the evaluation metrics (F1-score, precision, and recall) using PC, ICALiNGAM and DirectLinGAM models were zero or worst. What could be the possible reason for this?

  3. Finally, is there any way to generate data without noise or to set noise by ourselves?

I would appreciate it if someone from the community could help me resolve these queries or provide suggestions on generating synthetic data effectively using this script.

shaido987 commented 1 year ago

Hello,

What code are you using exactly to generate the synthetic data?

  1. Yes, a DAG is generated first which the data generation uses. For example:
    weighted_random_dag = DAG.erdos_renyi(n_nodes=10, n_edges=20, weight_range=(0.5, 2.0), seed=1)
    dataset = IIDSimulation(W=weighted_random_dag, n=2000, method='linear', sem_type='gauss')
    true_dag, X = dataset.B, dataset.X
  2. Could you add the code you are using here for reproduction?
  3. It is possible to set the noise_scale for the IIDSimulation which you can effectively remove the noise but there is no way to specify the any other noise type than the currently provided ones, i.e.
    • gauss, exp, gumbel, uniform, logistic (linear);
    • mlp, mim, gp, gp-add, quadratic (nonlinear).
MHassaanButt commented 1 year ago

Thank you so much,

Yes, I'm using the same code and I found the noise_scale variable. Thanks!

MHassaanButt commented 1 year ago

@shaido987 can you please confirm that what is the maximum and minimum value of the noise scale we can pass to the IIDSimulation?

adj_matrix = DAG.scale_free(n_nodes=n_nodes, n_edges=n_edges, seed=SEED)

dataset = IIDSimulation(W=adj_matrix, n=n, method=method, sem_type=sem_type, noise_scale=noise_scale)

This is the code I'm using to generate adj_matrix (GT) and dataset.

shaido987 commented 1 year ago

@MHassaanButt You can use any non-negative float as input to noise_scale. It is used as the standard deviation for np.random.normal inside the code:

https://github.com/huawei-noah/trustworthyAI/blob/717fbfe01c1709eeb141dbb4914c44d5873274ba/gcastle/castle/datasets/simulator.py#L368-L370

MHassaanButt commented 1 year ago

Thank you for the clarification.