talegari / bigdist

Store Distance Matrices on Disk:
https://talegari.github.io/bigdist/
4 stars 2 forks source link

Is it posisble to use a bigdist FM as an input for fastcluster::hclust? #4

Open JimTD opened 4 years ago

JimTD commented 4 years ago

Dear Prof Shrikanth,

I am faced with a problem that requires the production of a very large distance matrix (6.9 gB) and I wish to create a hierarchical clustering using the Ward method.

So far I have been able to successfully utilize your R library (BigDist) to create a FMB of the distance matrix and store it on a local drive. However, I have been searching for a hierarchical clustering solution for such a large distance matrix, and have yet to find a solution. I started with the obvious choice of fastcluster::hclust() as follows:

d3bigdist <- bigdist(mat = d3fordist, file = file.path("Output/distYFTbig")) ## note d3fordist is a large matrix with 257370 elements, size = 2Mb

can't get bigDist FBM to work with hclust

Connect to that big dist object on file

temp2 <- bigdist(file = file.path("Output/distYFTbig_42895_float"))

hcYFT <- fastcluster::hclust(temp2$fbm, method = "ward.D2") print(Sys.time())

Error in fastcluster::hclust(temp2$fbm, method = "ward.D2") : 'N' must be a single integer.

Do you know of an approach that allows a function, like hclust, to access the data within the FBM and piece, by piece, build a hierarchical tree? Or must I appropriately sample the FBM, build a tree, and then append remaining data from the FBM?

I have been working on a Windows machine with 8 gB of RAM - would working on a linux platform make any difference?

post note: Why am I trying to do this? I have been handed some legacy code, written in SAS, that has successfully used the Ward method to hierarchically cluster 42895 observations of 6 variables. I have not been able to find/construct a solution that mirrors this process in R. I have had success using partition-based clustering (from the Kmeans, CLARA packages) – however, it would be great if I could also compare these approaches with the hierarchical approach used in the original SAS code.

Warm regards

Jim Dell

talegari commented 4 months ago

@JimTD I have found a way to use https://github.com/dipterix/filearray/ and https://github.com/bwlewis/hclust_in_R/blob/master/hc.R to implement hclust on bigdist. I will implement shortly and update you.