Closed parsifal9 closed 5 months ago
Hello,
Thank you for your interest! Directly taking continuous data as input is not supported, but now that you mention it I'll work on updating the interface to accept them. In the meantime, continuous variables can be handled by first binning them and constructing a bin map as shown in the example notebook. You can choose those bins however you want, but two options would be to use quantiles to bin or to use threshold guessing as described in GOSDT. Here's a quick example code chunk for how to do that:
`
import numpy as np
import pandas as pd
from gosdt.model.threshold_guess import compute_thresholds
n_est = 40
max_depth = 1
X_all = df_unbinned.iloc[:, :-1]
y = df_unbinned.iloc[:, -1]
X_binned_full, thresholds, header, threshold_guess_time = compute_thresholds(X_all.copy(), y.copy(), n_est, max_depth)
df = pd.concat((X_binned_full, y), axis=1)
col_map = {}
for i, c in enumerate(df_unbinned.columns):
col_map[c] = i
bins = {}
for b in range(len(df_unbinned.columns)-1):
bins[b] = []
counter = 0
for h in header:
cur_var = col_map[h.split('<=')[0]]
bins[cur_var] = bins[cur_var] + [counter]
counter += 1
bin_map = bins
` From there, you can pass bin_map and df into the RID interface as shown in example.ipynb.
Yes, those are the correct definitions for div and sub MR!
Hope this helps!
Hi Jon, thanks for that. I will: 1) give up trying to feed continuous variables into the code; 2) keep an eye on this repository for any updates.
I had a go at binning the data as you suggest. gosdt proved very difficult to install so for the time being I have resorted to creating the binned data set in R and then importing it. With 8 continuous variables binned into 5 bins, I was able to get everything running.
I would like to try some larger examples. I see that I have 10 python processes running at 100% -- but some other python processes running but using no memory. I was running this with 20 cores. Is there a flag that will get it to use more cores?
%CPU TIME+ COMMAND
100.3 0:40.41 python
100.3 0:40.36 python
100.3 0:40.38 python
100.3 0:40.41 python
100.3 0:40.42 python
100.0 0:40.38 python
100.0 0:40.40 python
100.0 0:40.40 python
100.0 0:40.37 python
100.0 0:40.39 python
0.000 0:02.35 python
0.000 0:00.12 python
0.000 0:28.60 python
0.000 0:23.43 python
0.000 0:20.48 python
0.000 0:04.44 python
0.000 0:04.49 python
0.000 0:02.36 python
0.000 0:14.47 python
Thanks also the the clarification on div and sub MR. Bye R
TL;DR: That's pretty strange -- can you run os.cpu_count()
and see how many CPU's it observes?
Long answer: The code does parallelization in two primary steps:
During development, we found that over-parallelizing the first step reduced performance, so there is a parameter called max_par_for_gosdt
that throttles the number of jobs spawned for step 1. The default value of this is 5, so I would have expected at most 5 cores to be in use if that was the constraint.
The number of jobs for the second step is based on the number of CPU's Python thinks are available, evaluated via os.cpu_count()
. The only reason I can think of for using 10/20 CPU's is that this value is incorrect. Can you run os.cpu_count()
in the same environment you're executing this code and see what it says?
Hi Jon, I am using the Slurm workload manager, which may be the problem. I request resources like this
> salloc --mem=100GB --nodes=1 --ntasks-per-node 20 -J interactive -t 5:00:00 srun --pty /bin/bash -l
os.cpu_count()
always reports 64 which is the number of cores on the node, not the number I requested.
> ps -ef | grep dun280 | grep python | wc
68 546 3758
So I have lots of running python processes but only 10 of them using much CPU. I tried making this change in rashomon_importance_distribution.py
self.num_cpus = os.cpu_count()
to
self.num_cpus = int(os.environ['SLURM_NTASKS_PER_NODE'])
but that has just changed the total number of python processes running -- there are still only 10 working.
Sorry for the slow response! Can you share a complete minimum working example of what you're running? I also use slurm, so I might be able to debug a bit on my side. I am currently at a bit of a loss on this issue
Hi Jon,
I have been away for a few days, so just back on this now. I have a small example of 160 observations and 8 variables binned into a total of 41 columns. It is not too large -- could I email it to you rather than putting it all here?
Also, you said that you could modify the code to directly take continuous data as input. Would that allow the GOSDT trees to use continuous variables (i.e. all possible splits of the continuous variables) or would it do a binning of the data internally. In which case I might like to do my own binning to have more control over it.
I am hoping to use this on data like your transcriptomic example, so it would be count data with say 100 variables (selected from a much larger number). If I bin each variable into 10 bins, I suspect that I am going to need all of the processors working on this.
Bye
Hello,
Yes, emailing the data works for me -- you can send it to jon.donnelly@duke.edu.
Re the binning, the code modification would simply be binning the data internally. I agree that, in general, it's better that you bin it yourself to better control how bins are found. I still plan to add the option for convenience, but it need not be a blocker here.
It makes sense that you'll need all processors working for your complete data to keep things tractable. So that it's easy to replicate the issue on my side, could you send me the code snippet in which you're calling RID on your data?
Thanks, Jon
Quick followup -- the parallelization used in RID is along the number of resampled datasets considered, as set by num_resamples
. In the example code, the number of resamples is set to 10, meaning at most 10 compute intensive processes will be active at a time. I believe this might be the behavior you're observing -- if you change num_resamples
to 20, do you see 20 processes at 100% compute?
In general, you want a larger number number of resamples than given in the example. Theorem 2 of the paper provides a principled way to select the minimum number of resamples for your application given a desired confidence that your estimate of RID is within a given distance of the true value for each quantile. Alternatively, as a rule of thumb, several hundred bootstrap resamples is usually sufficient -- in our case study, we used 738..
HI Jon,
yes, I think that I see that.
I was confused by the very large number of stopped python process that fill up the top table. i.e. with 10 cores i.e.
salloc --mem=100GB --nodes=1 --ntasks-per-node 10 -J interactive -t 5:00:00 srun --pty /bin/bash -l
and
n_resamples=10,
max_par_for_gosdt=5
I see things like this.
6744 dun280 20 0 1198192 926120 5776 S 0.000 0.175 0:06.65 python
16742 dun280 20 0 1154892 883744 5788 S 0.000 0.167 0:07.68 python
16775 dun280 20 0 1241000 841920 6540 R 100.0 0.159 3:13.96 python
16745 dun280 20 0 1077256 804172 5748 S 0.000 0.152 0:08.54 python
16774 dun280 20 0 1169352 770412 6540 R 100.3 0.146 3:13.67 python
16743 dun280 20 0 967104 692168 5748 S 0.000 0.131 0:09.37 python
16772 dun280 20 0 932252 533440 6772 R 98.34 0.101 3:13.45 python
16771 dun280 20 0 884116 485188 6724 R 100.0 0.092 3:13.24 python
16779 dun280 20 0 703800 304844 6640 R 97.35 0.058 3:13.66 python
16289 dun280 20 0 736048 138036 11712 S 0.000 0.026 0:01.82 python
16784 dun280 20 0 533028 128928 2896 S 0.000 0.024 0:00.00 python
16832 dun280 20 0 533284 128916 2824 S 0.000 0.024 0:00.00 python
16833 dun280 20 0 533284 128916 2824 S 0.000 0.024 0:00.00 python
16834 dun280 20 0 533284 128916 2824 S 0.000 0.024 0:00.00 python
16831 dun280 20 0 533284 128912 2824 S 0.000 0.024 0:00.00 python
16829 dun280 20 0 533284 128908 2824 S 0.000 0.024 0:00.00 python
16830 dun280 20 0 533284 128908 2824 S 0.000 0.024 0:00.00 python
16827 dun280 20 0 533284 128904 2824 S 0.000 0.024 0:00.00 python
16828 dun280 20 0 533284 128904 2824 S 0.000 0.024 0:00.00 python
16820 dun280 20 0 533284 128900 2824 S 0.000 0.024 0:00.00 python
16821 dun280 20 0 533284 128900 2824 S 0.000 0.024 0:00.00 python
16822 dun280 20 0 533284 128900 2824 S 0.000 0.024 0:00.00 python
16823 dun280 20 0 533284 128900 2824 S 0.000 0.024 0:00.00 python
16824 dun280 20 0 533284 128900 2824 S 0.000 0.024 0:00.00 python
16825 dun280 20 0 533284 128900 2824 S 0.000 0.024 0:00.00 python
16826 dun280 20 0 533284 128900 2824 S 0.000 0.024 0:00.00 python
16819 dun280 20 0 533284 128896 2824 S 0.000 0.024 0:00.00 python
16817 dun280 20 0 533284 128892 2824 S 0.000 0.024 0:00.00 python
16818 dun280 20 0 533284 128892 2824 S 0.000 0.024 0:00.00 python
16807 dun280 20 0 533284 128888 2824 S 0.000 0.024 0:00.00 python
However apart from this anomaly, I think I am seeing the correct behavior. so I think that this issue is fixed. Bye R
Ok, thank you for working through this! I'll mark this as closed, in that case.
Hi Jon,
very interesting paper! A couple of (probably simple) questions
1) I see from the simulations (Chen and Friedman) that you give results for problems with continuous covariates. However the code calls for a one-hot encoding of a factor variable. I have not managed to get it working with continuous data (I tried). Should it work? it this feature still coming? am I doing it wrong?
2) I can't find an explanation of 'sub_mr' and 'div_mr' in the paper. Is this just the same as
from the "All Models are Wrong ...." paper?
bye and thanks,