pcarbo / varbvs

Large-scale Bayesian variable selection for R and MATLAB.
http://pcarbo.github.io/varbvs
GNU General Public License v3.0
42 stars 14 forks source link

error when running the mcmc method (adapted from example 1) #6

Open jcrdubois opened 9 years ago

jcrdubois commented 9 years ago

I have been trying out your software for feature selection on a neuroimaging dataset. The input data is of dimensions 42 x 300 (eventually I want to work with more predictors, but I am currently testing the pipeline).

I adapted the code your provided in example1.m for my problem. The code I use is encapsulated in a function, as follows:

function POST = runbayesianfeatsel(y,X,method)
% These two parameters specify the Beta prior on the proportion of variables
% (SNPs) that are included in the linear model of Y.
a = 0.02;
b = 1; 
% This parameter specifies the prior on the variance of the regression
% coefficients (sa). For more information on this prior, see VARSIMBVS
% and BVSMCMC.
c = 0.02;
% Candidate values of the variance of the residual (sigma), the prior
% variance of the regression coefficients (sa), and logarithm of the prior
% inclusion probability (log10q).
sigma  = (8:13)';
sa     = (0.025:0.025:0.4)';
log10q = (-2.5:0.25:-1)';

% Set the random number generator seed.
rng('shuffle')

switch method

    case 'variational'
        % COMPUTE VARIATIONAL ESTIMATES.
        fprintf('Computing variational estimates.\n');
        [sigma sa log10q] = ndgrid(sigma,sa,log10q);
        [w alpha mu] = varsimbvs(X,y,sigma,sa,log10q,a,b,c);
        POST = alpha;
case 'mcmc'
        % COMPUTE MCMC ESTIMATES.
        ns = 2e4;  % Length of Markov chain.
        m0 = 100;  % Upper limit on number of selected variables in Markov chain.
        fprintf('Computing MCMC estimates: \n');
        sx = sum(var1(X));
        [ss sas qs PIP] = bvsmcmc(X,y,a,b,@(x) logpve(c*sx,x),m0,ns);
        POST = PIP;
end
return

I have been experiencing the following error when running the mcmc method on my data, every now and then and not reproducibly:

Attempted to access ng(0); index must be a positive integer or logical.

Error in bvsmcmc>samplegamma (line 206)
    i    = ng(k);

Error in bvsmcmc (line 75)
    [gamma g S] = samplegamma(X,S,xy,gamma,g,sigma,sa,a,b,m0);

Error in networkanalysis>runbayesianfeatsel (line 173)
    [ss sas qs PIP] = bvsmcmc(X,y,a,b,@(x) logpve(c*sx,x),m0,ns);

Error in networkanalysis (line 114)
    POST{iSub} = runbayesianfeatsel(y,X,method);

206       i    = ng(k);

It looks like in samplegamma's workspace, softmax(logr) is a vector of nan + nan*i, which seems wrong.

Please advise on what may be going wrong, let me know if you need more detail.

PS: two unrelated questions

1) I have not yet tried to change the priors, which may not be adapted to my problem. Do the priors matter much in your experience? I am basically working on a problem very similar to Smith et al, Functional Connectomics from Resting-State fMRI, TiCS (2013) Fig 7 [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4004765/].

2) it seems to me thus far that the variational inference is slower than the mcmc posterior distribution estimation. Of course I expect that this will change as the input data grows. Do you have a sense on when variational inference becomes faster than MCMC?

Thanks a lot!

pcarbo commented 9 years ago

The issue with NaN's being produced has not yet been resolved. I will be on the lookout for this bug.

Questions #1 and #2 were discussed over email with Julien Dubois.

(1) The priors can be very important, but it is very problem-dependent how important they are. My strategy is usually to try a range of different values for the prior, and see which ones yield the largest (normalized) weights. Note that if you download my most recent version of varbvs.m from github, I've added two new options to compute maximum likelihood estimates of the residual variance and prior variance of the regression coefficients. This should dramatically speed up the variational algorithm, and give you an initial sense as to what some of the priors should be.

(2) Julien found that the the coordinate ascent updates for computing the variational approximation converged much faster once y and the columns of X were centered to have a mean of zero (which is equivalent to including an intercept in the linear regresion).