thibautjombart / adegenet

adegenet: a R package for the multivariate analysis of genetic markers
165 stars 64 forks source link

Parallel option for find.clusters and dapc? #339

Open ac-harris opened 1 year ago

ac-harris commented 1 year ago

Hi, there--

I have a SNPbin object with ~500 individuals genotyped at ~670,000 SNPs. I selected an optimum number of PCs to retain for DAPC using the xval function run in parallel, which took about 3 days on our server. However, as far as I can tell, there's no parallel option for the find.clusters or dapc functions. We've been running the find.clusters function on this dataset using the optimum number of PCs from xval for ~ 2 weeks with no end in sight... Is there a way to parallel-ize find.clusters and dapc? Are there plans to add this functionality to the functions themselves?

I understand that we could randomly subset markers and run DAPC, but in an ideal world, I'd like to be able to compare patterns and inferences across the full dataset and a subset dataset. The code we ran for find.clusters is below.

# find clusters
B <- xval_iter[[6]] # number PCs achieving lowest RMSE from cross-validation (51)
print("starting find.clusters")
set.seed(1500)
grp <- find.clusters(up, n.pca = B, max.n.clust = 32)
save.image("clust.Rda")
print("find.clusters complete.")

Thank you! Audrey

gvp681 commented 1 year ago

Hi,

Was this issue resolved? I am having a similar problem with the program and would like to figure out how to optimize the processing time.

Thanks!