somakd / fad

Factor Analysis for Data in R
3 stars 1 forks source link

Warning message while processing a simulated data from MixSim #2

Open ashishjain1988 opened 4 years ago

ashishjain1988 commented 4 years ago

Hi,

I am receiving the following warning message when I am trying to run fad on a simulated from MixSim Package.

Warning message: In fad(scale(d), 4) : Algorithm may not have converged. Try another starting value.

Can you please comment on what steps should I take to rectify this?

Steps to simulate dataset I am first simulating the factor matrix using the MixGOM function. The details of the simulated data matrix are:

factors: 5

samples: 100

Clusters:4

GeneralizedOverlap: 0.01

After that, we simulated the weight matrix (loadings matrix) using Gaussian Distribution (N(mean=2,sd=2)) with a total number of features as 1125. We further multiplied the simulated factor matrix and loading matrix to generate our dataset (100 X 1125). We then added noise to that dataset using Gamma distribution (rate = 1).

Note: fad is not giving warning if I simulate weight matrix using Gaussian Distribution (N(mean=2,sd=1)) with a total number of features as 1000.

Thanks, Ashish Jain

ashishjain1988 commented 4 years ago

You can download the simulated data @ https://github.com/ashishjain1988/machine-learning-examples/blob/master/simulated_data2_1.txt

somakd commented 4 years ago

Some of the estimates of the uniqueness occur at the boundary. I tried running from multiple starting values and all of the final estimates seemed to be very close (within 3-4 decimal places).

Also I tried estimating using EM algorithm using the cate package. The final solution from the EM had lower log-likelihood. In fact, the EM algorithm gives suboptimal answer because the first order conditions are not met:

BiocManager::install("cate")
h = cate::factor.analysis(Y=scale(d),r=4)
Sigma = tcrossprod(h$Gamma) + diag(h$Sigma)

Under the first order optimality, all the diagonals of Sigma must be 1 because scaled(d) has been used, but you can check all entries of diag(Sigma) are less than 1.

Conclusion: I'd think it's safe to ignore the warning message.

Do you find similar problem if you generate the noise from normal distribution instead of gamma?

ashishjain1988 commented 4 years ago

Thank you for the help! I am getting the same warning if I use noise from a normal distribution for K=1 to 4.

somakd commented 4 years ago

Can you provide a little more insight on how the data was generated? Is it like below?

library(MixSim)
n = 100
p = 1125 
q = 5 # number of factors
nclust = 4
overlap = 0.01

cluster.object <- MixGOM(goMega = overlap, K = nclust, p = q, sph = T)
z <- simdataset(n = n,Pi = cluster.object$Pi,  
                Mu = cluster.object$Mu, S = cluster.object$S)$X
L <- matrix(rnorm(p*q,2,2),p, q)
Y <- tcrossprod(z,L) + matrix(rexp(n*p,rate = 1) - 1, n, p)
ashishjain1988 commented 4 years ago

In our case, we are simulating the data following some assumptions that we made in our tool. Below is the code to generate the dataset.

library(MixSim)
##MixSim Simulations
#Q <- MixSim(MaxOmega = 0.20, BarOmega = 0.05, K = 4, p = 10,hom = TRUE,sph = TRUE)
L<-4 #Clusters
K<-5 #Factors
N<-100 #Samples
Q <- MixGOM(goMega = 0.01, K = L-1, p = K, sph = TRUE,hom = TRUE)

#Rescale Mu by dividing with Sigma values
rescaledNu<-Q$Mu
for(i in c(1:(L-1)))
{
  rescaledNu[i,] <- Q$Mu[i,]/diag(sqrt(Q$S[,,i]))
}

#Rescaling Pi
ProbablityOfLastCluster<-Q$Pi[1]
rescalePi<-c(Q$Pi,ProbablityOfLastCluster)/(sum(Q$Pi)+ProbablityOfLastCluster)

##Getting Mu of final Cluster
finalCluster<-c()
for(i in c(1:K))
{
  finalCluster<-c(finalCluster, -sum(rescaledNu[,i]*rescalePi[1:(L-1)]/rescalePi[L]))
}
rescaledNu<-rbind(rescaledNu,finalCluster)

#Making Covariance matrix equal to Identity
newS<-array(0, c(K, K, L))
for(i in c(1:L))
{
  m<-matrix(0,ncol = 5,nrow = 5)
  diag(m)<-rep(1,5)
  newS[,,i]<-m
}

FactorClusters <- simdataset(n = N, Pi = rescalePi, Mu = rescaledNu, S = newS)

simulate_weights <- function(k,## factors
                             j, ## biomark
                             params=list(c(mean=1,sd=1)), ## parameter distribution
                             J, ## all biomarkers
                             labelName="gene"
){
  #print(params)
  m <- params["mean"]
  sd <- params["sd"]
  c= matrix(rnorm(k*j,mean=m,sd=sd),ncol=j,nrow=k)
  c= cbind(c, matrix(0,nrow=k, ncol=J-j))
  idx <- sample(1:ncol(c))
  positive <- sapply(1:j, function (ll) which(ll== idx))
  c<-c[,idx]
  colnames(c) <- sprintf("gene%s", 1:ncol(c))
  #print(head(positive))
  positive <- paste0("gene",positive)
  return(list(C=c,positive=positive))
}

GammaNoise<-function(noOfSamples,noOfFeatures,rate)
{
  sdForFeatures<-rgamma(noOfFeatures,1)
  lNoise<-lapply(1:noOfFeatures,function(x){
    rnorm(noOfSamples, mean=0, sd=sdForFeatures[x])
  })
  return(t(do.call("rbind", lNoise)))
}

expWeights<-simulate_weights(K,200,(c(mean=2,sd=1)),1000,"gene")
probesWeights<-simulate_weights(K,250,(c(mean=2,sd=2)),1125,"probe")

#Add Noise for gene expression data
dataExpression<- FactorClusters$X %*% expWeights$C
dataExpression <- dataExpression + GammaNoise(nrow(dataExpression),ncol(dataExpression),1)
#dataExpression <- dataExpression + matrix(rnorm(N*ncol(dataExpression), mean=0, sd=0.1),nrow=N, ncol=ncol(dataExpression))

#Add Noise for DNAMethylation data
dataMethylation<- FactorClusters$X %*% probesWeights$C
dataMethylation <- dataMethylation + GammaNoise(nrow(dataMethylation),ncol(dataMethylation),1)