nasaharvest / dora

Domain-agnostic Outlier Ranking Algorithms (DORA) - SMD cross-divisional use case demonstration of AI/ML
MIT License
10 stars 3 forks source link

Add ability to generate causal graphs #68

Open stevenlujpl opened 2 years ago

stevenlujpl commented 2 years ago

Hi @hannah-rae, @urebbapr , @wkiri , @emhuff ,

I've checked the initial implementation of causal graphs in the causal-graph branch.

Example outputs of causal graphs

The example outputs of causal graphs generated using sample_data/earth_fieldsamples/points_to_fit.csv (data_to_fit) and sample_data/earth_fieldsamples/kenya_points_to_predict.csv (data_to_score) are shown below. Please note that I filtered out 981 data points that contain missing values from the sample_data/earth_fieldsamples/points_to_fit.csv file.

  1. Cluster 0 causal graph causal_graph_cluster_0

  2. Cluster 1 causal graph causal_graph_cluster_1

  3. Cluster 2 causal graph causal_graph_cluster_2

  4. Cluster 3 causal graph causal_graph_cluster_3

  5. Cluster 4 causal graph causal_graph_cluster_4

  6. SOM clustering results SOM-demud.csv

Implementation summary

Causal graphs are currently implemented together with the kmeans or SOM clustering algorithm in the Results Organization module. This is how causal graphs are generated in the DES codebase, and for the initial implementation of causal graphs in DORA, I decided to do the same thing. I don't think clustering algorithms are necessary to generate causal graphs. It seems to me that we can generate causal graphs for individual data points instead of a group of data points. If generating causals graphs for individual data points is desired, I can add this ability in DORA. Please let me know what you think.

There is one issue that I don't know how to resolve yet. Causal graphs are generated using classes/functions in fges-py github repository, but this repository isn't installable (the authors don't provide a setup.py script). This isn't a big problem for us to use causal graphs on UMD/JPL machines. We can manually git clone the repository, and do something like sys.path.append("/PATH/TO/fges-py/") to import classes/functions we need. However, this will become a problem when we publish the DORA codebase to Pypi as a pip installable package. I will need to think more about how to resolve this problem. Please let me know if you have any suggestions.

Use causal graphs

For now, causal graphs must be generated with kmeans or SOM clustering algorithm. Please see the following example configs for Results Organization module:

  1. Use causal graphs with kmeans clustering algorithm
results: {
    kmeans: {
        n_clusters: 5,
        causal_graph: True
    }
}
  1. Use causal graphs with SOM clustering algorithm
results: {
    som: {
        n_clusters: 5,
        causal_graph: True
    }
}

There will be one causal graph generated per cluster group, and the causal graphs will be saved in the directory defined by out_dir option in the config file.

stevenlujpl commented 2 years ago

Please note that I am aware of the build failures (code formatting issues, please see the screenshot below) caused by the implementation of the causal graph. I can't fix these code formatting issues because I have to use sys.path.append('/PATH/TO/fges-py') so that I can import the classes/functions needed for causal graphs. I will come up with something to replace sys.path.append('/PATH/TO/fges-py') and fix the code formatting issues.

Screen Shot 2021-10-08 at 3 33 05 PM

stevenlujpl commented 2 years ago

Below is a temporary solution to install DORA with causal graphs (for @hannah-rae to install it on UMD machine).

  1. Clone the fges-py github repository (https://github.com/eberharf/fges-py)
git clone https://github.com/eberharf/fges-py.git
  1. Pull the latest updates from causal-graph branch of the DORA repository
git pull origin causal-graph
  1. Replace the path in sys.path.append() with the path to fges-py repository on UMD machine.

https://github.com/nasaharvest/dora/blob/0b8b67546a6ae4a1fba335de5851752c6f788a3f/dora_exp_pipeline/dora_results_organization.py#L8-L11

  1. Go to the root directory of DORA repository, and run pip install . (please note the . at the end).
stevenlujpl commented 2 years ago

I changed the graph layout to be circular. With the circular layout, at least we can see what nodes are connected. Please take a look at the following examples, and let me know what you think. Thanks. causal_graph_cluster_0 causal_graph_cluster_1 causal_graph_cluster_2 causal_graph_cluster_3 causal_graph_cluster_4

wkiri commented 2 years ago

@stevenlujpl I think these look great.

If you have time for tiny updates, I suggest (1) highlighting (e.g. in red) any lines that connect to the "cluster" node (since they are of most immediate interest and I think the others are constant for all clusters), (2) labeling "cluster" as "cluster X" to show the cluster index, and (3) using a light color to fill the nodes (instead of dark blue) so that the black text on top is easier to read.

stevenlujpl commented 2 years ago

@wkiri Thanks for the comments. I've incorporated them into the code. In addition, I also added the sparsity parameter in the config file and seeded the SOM clustering algorithm. Please see the new graphs below (please note that the causal relations are different than the examples shown in the post above because the seed parameter used is different).
causal_graph_cluster_4

wkiri commented 2 years ago

@stevenlujpl The updated visualization looks fantastic!

hannah-rae commented 2 years ago

@stevenlujpl Is this ready to be closed now?

stevenlujpl commented 2 years ago

@hannah-rae, Not yet. Currently, all the updates for causal graphs are in causal-graph branch. I am waiting to hear from Eric regarding whether the Caltech professor who developed fges-py will create a setup.py script to package the repository or not. Below are the items I need to complete before we can close this issue: