continuous data - Githubissues

parsifal9 commented 7 months ago

Hi Jon,

very interesting paper! A couple of (probably simple) questions

1) I see from the simulations (Chen and Friedman) that you give results for problems with continuous covariates. However the code calls for a one-hot encoding of a factor variable. I have not managed to get it working with continuous data (I tried). Should it work? it this feature still coming? am I doing it wrong?

2) I can't find an explanation of 'sub_mr' and 'div_mr' in the paper. Is this just the same as

$MR(f) := \dfrac{e{\textrm{switch}}}{e{\textrm{orig}}}$
$MR{\textrm{difference}}(f) := e{\textrm{switch}}- e_{\textrm{orig}}$

from the "All Models are Wrong ...." paper?

bye and thanks,

jdonnelly36 commented 7 months ago

Hello,

Thank you for your interest! Directly taking continuous data as input is not supported, but now that you mention it I'll work on updating the interface to accept them. In the meantime, continuous variables can be handled by first binning them and constructing a bin map as shown in the example notebook. You can choose those bins however you want, but two options would be to use quantiles to bin or to use threshold guessing as described in GOSDT. Here's a quick example code chunk for how to do that:

`

    import numpy as np
    import pandas as pd
    from gosdt.model.threshold_guess import compute_thresholds
    n_est = 40
    max_depth = 1
    X_all = df_unbinned.iloc[:, :-1]
    y = df_unbinned.iloc[:, -1]
    X_binned_full, thresholds, header, threshold_guess_time = compute_thresholds(X_all.copy(), y.copy(), n_est, max_depth)
    df = pd.concat((X_binned_full, y), axis=1)

    col_map = {}
    for i, c in enumerate(df_unbinned.columns):
        col_map[c] = i

    bins = {}
    for b in range(len(df_unbinned.columns)-1):
        bins[b] = []

    counter = 0
    for h in header:
        cur_var = col_map[h.split('<=')[0]]
        bins[cur_var] = bins[cur_var] + [counter]
        counter += 1
    bin_map = bins

` From there, you can pass bin_map and df into the RID interface as shown in example.ipynb.

Yes, those are the correct definitions for div and sub MR!

Hope this helps!

parsifal9 commented 7 months ago

Hi Jon, thanks for that. I will: 1) give up trying to feed continuous variables into the code; 2) keep an eye on this repository for any updates.

I had a go at binning the data as you suggest. gosdt proved very difficult to install so for the time being I have resorted to creating the binned data set in R and then importing it. With 8 continuous variables binned into 5 bins, I was able to get everything running.

I would like to try some larger examples. I see that I have 10 python processes running at 100% -- but some other python processes running but using no memory. I was running this with 20 cores. Is there a flag that will get it to use more cores?

  %CPU     TIME+ COMMAND                                                                                                 
 100.3   0:40.41 python                                                                                                  
 100.3   0:40.36 python                                                                                                  
 100.3   0:40.38 python                                                                                                  
 100.3   0:40.41 python                                                                                                  
 100.3   0:40.42 python                                                                                                  
 100.0   0:40.38 python                                                                                                  
 100.0   0:40.40 python                                                                                                  
 100.0   0:40.40 python                                                                                                  
 100.0   0:40.37 python                                                                                                  
 100.0   0:40.39 python                                                                                                  
 0.000   0:02.35 python                                                                                                  
 0.000   0:00.12 python                                                                                                  
 0.000   0:28.60 python                                                                                                  
 0.000   0:23.43 python                                                                                                  
 0.000   0:20.48 python                                                                                                  
 0.000   0:04.44 python                                                                                                  
 0.000   0:04.49 python                                                                                                  
 0.000   0:02.36 python                                                                                                  
 0.000   0:14.47 python

Thanks also the the clarification on div and sub MR. Bye R

jdonnelly36 commented 7 months ago

TL;DR: That's pretty strange -- can you run os.cpu_count() and see how many CPU's it observes?

Long answer: The code does parallelization in two primary steps:

When computing the various bootstrap Rashomon sets
When evaluating variable importance over each Rashomon set

During development, we found that over-parallelizing the first step reduced performance, so there is a parameter called max_par_for_gosdt that throttles the number of jobs spawned for step 1. The default value of this is 5, so I would have expected at most 5 cores to be in use if that was the constraint.

The number of jobs for the second step is based on the number of CPU's Python thinks are available, evaluated via os.cpu_count(). The only reason I can think of for using 10/20 CPU's is that this value is incorrect. Can you run os.cpu_count() in the same environment you're executing this code and see what it says?

parsifal9 commented 7 months ago

Hi Jon, I am using the Slurm workload manager, which may be the problem. I request resources like this

> salloc --mem=100GB --nodes=1  --ntasks-per-node 20 -J interactive -t 5:00:00 srun --pty /bin/bash -l

os.cpu_count()

always reports 64 which is the number of cores on the node, not the number I requested.

>  ps -ef | grep dun280 | grep python | wc
     68     546    3758

So I have lots of running python processes but only 10 of them using much CPU. I tried making this change in rashomon_importance_distribution.py

 self.num_cpus = os.cpu_count()

to

self.num_cpus = int(os.environ['SLURM_NTASKS_PER_NODE'])

but that has just changed the total number of python processes running -- there are still only 10 working.

jdonnelly36 commented 6 months ago

Sorry for the slow response! Can you share a complete minimum working example of what you're running? I also use slurm, so I might be able to debug a bit on my side. I am currently at a bit of a loss on this issue

parsifal9 commented 6 months ago

Hi Jon,

I have been away for a few days, so just back on this now. I have a small example of 160 observations and 8 variables binned into a total of 41 columns. It is not too large -- could I email it to you rather than putting it all here?

Also, you said that you could modify the code to directly take continuous data as input. Would that allow the GOSDT trees to use continuous variables (i.e. all possible splits of the continuous variables) or would it do a binning of the data internally. In which case I might like to do my own binning to have more control over it.

I am hoping to use this on data like your transcriptomic example, so it would be count data with say 100 variables (selected from a much larger number). If I bin each variable into 10 bins, I suspect that I am going to need all of the processors working on this.

Bye

jdonnelly36 commented 6 months ago

Hello,

Yes, emailing the data works for me -- you can send it to jon.donnelly@duke.edu.

Re the binning, the code modification would simply be binning the data internally. I agree that, in general, it's better that you bin it yourself to better control how bins are found. I still plan to add the option for convenience, but it need not be a blocker here.

It makes sense that you'll need all processors working for your complete data to keep things tractable. So that it's easy to replicate the issue on my side, could you send me the code snippet in which you're calling RID on your data?

Thanks, Jon

jdonnelly36 commented 6 months ago

Quick followup -- the parallelization used in RID is along the number of resampled datasets considered, as set by num_resamples. In the example code, the number of resamples is set to 10, meaning at most 10 compute intensive processes will be active at a time. I believe this might be the behavior you're observing -- if you change num_resamples to 20, do you see 20 processes at 100% compute?

In general, you want a larger number number of resamples than given in the example. Theorem 2 of the paper provides a principled way to select the minimum number of resamples for your application given a desired confidence that your estimate of RID is within a given distance of the true value for each quantile. Alternatively, as a rule of thumb, several hundred bootstrap resamples is usually sufficient -- in our case study, we used 738..

parsifal9 commented 6 months ago

HI Jon,

yes, I think that I see that.

I was confused by the very large number of stopped python process that fill up the top table. i.e. with 10 cores i.e.

salloc  --mem=100GB --nodes=1  --ntasks-per-node 10 -J interactive -t 5:00:00 srun --pty /bin/bash -l

and

n_resamples=10,
max_par_for_gosdt=5

I see things like this.

6744 dun280    20   0 1198192 926120   5776 S 0.000 0.175   0:06.65 python                                                                                                  
16742 dun280    20   0 1154892 883744   5788 S 0.000 0.167   0:07.68 python                                                                                                  
16775 dun280    20   0 1241000 841920   6540 R 100.0 0.159   3:13.96 python                                                                                                  
16745 dun280    20   0 1077256 804172   5748 S 0.000 0.152   0:08.54 python                                                                                                  
16774 dun280    20   0 1169352 770412   6540 R 100.3 0.146   3:13.67 python                                                                                                  
16743 dun280    20   0  967104 692168   5748 S 0.000 0.131   0:09.37 python                                                                                                  
16772 dun280    20   0  932252 533440   6772 R 98.34 0.101   3:13.45 python                                                                                                  
16771 dun280    20   0  884116 485188   6724 R 100.0 0.092   3:13.24 python                                                                                                  
16779 dun280    20   0  703800 304844   6640 R 97.35 0.058   3:13.66 python                                                                                                  
16289 dun280    20   0  736048 138036  11712 S 0.000 0.026   0:01.82 python                                                                                                  
16784 dun280    20   0  533028 128928   2896 S 0.000 0.024   0:00.00 python                                                                                                  
16832 dun280    20   0  533284 128916   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16833 dun280    20   0  533284 128916   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16834 dun280    20   0  533284 128916   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16831 dun280    20   0  533284 128912   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16829 dun280    20   0  533284 128908   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16830 dun280    20   0  533284 128908   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16827 dun280    20   0  533284 128904   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16828 dun280    20   0  533284 128904   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16820 dun280    20   0  533284 128900   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16821 dun280    20   0  533284 128900   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16822 dun280    20   0  533284 128900   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16823 dun280    20   0  533284 128900   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16824 dun280    20   0  533284 128900   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16825 dun280    20   0  533284 128900   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16826 dun280    20   0  533284 128900   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16819 dun280    20   0  533284 128896   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16817 dun280    20   0  533284 128892   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16818 dun280    20   0  533284 128892   2824 S 0.000 0.024   0:00.00 python                                                                                                  
16807 dun280    20   0  533284 128888   2824 S 0.000 0.024   0:00.00 python

However apart from this anomaly, I think I am seeing the correct behavior. so I think that this issue is fixed. Bye R

jdonnelly36 commented 5 months ago

Ok, thank you for working through this! I'll mark this as closed, in that case.

jdonnelly36 / Rashomon_Importance_Distribution

continuous data #1