I am faced with a problem that requires the production of a very large distance matrix (6.9 gB) and I wish to create a hierarchical clustering using the Ward method.
So far I have been able to successfully utilize your R library (BigDist) to create a FMB of the distance matrix and store it on a local drive. However, I have been searching for a hierarchical clustering solution for such a large distance matrix, and have yet to find a solution. I started with the obvious choice of fastcluster::hclust() as follows:
d3bigdist <- bigdist(mat = d3fordist, file = file.path("Output/distYFTbig")) ## note d3fordist is a large matrix with 257370 elements, size = 2Mb
Error in fastcluster::hclust(temp2$fbm, method = "ward.D2") :
'N' must be a single integer.
Do you know of an approach that allows a function, like hclust, to access the data within the FBM and piece, by piece, build a hierarchical tree? Or must I appropriately sample the FBM, build a tree, and then append remaining data from the FBM?
I have been working on a Windows machine with 8 gB of RAM - would working on a linux platform make any difference?
post note: Why am I trying to do this? I have been handed some legacy code, written in SAS, that has successfully used the Ward method to hierarchically cluster 42895 observations of 6 variables.
I have not been able to find/construct a solution that mirrors this process in R. I have had success using partition-based clustering (from the Kmeans, CLARA packages) – however, it would be great if I could also compare these approaches with the hierarchical approach used in the original SAS code.
Dear Prof Shrikanth,
I am faced with a problem that requires the production of a very large distance matrix (6.9 gB) and I wish to create a hierarchical clustering using the Ward method.
So far I have been able to successfully utilize your R library (BigDist) to create a FMB of the distance matrix and store it on a local drive. However, I have been searching for a hierarchical clustering solution for such a large distance matrix, and have yet to find a solution. I started with the obvious choice of fastcluster::hclust() as follows:
d3bigdist <- bigdist(mat = d3fordist, file = file.path("Output/distYFTbig")) ## note d3fordist is a large matrix with 257370 elements, size = 2Mb
can't get bigDist FBM to work with hclust
Connect to that big dist object on file
temp2 <- bigdist(file = file.path("Output/distYFTbig_42895_float"))
hcYFT <- fastcluster::hclust(temp2$fbm, method = "ward.D2") print(Sys.time())
Error in fastcluster::hclust(temp2$fbm, method = "ward.D2") : 'N' must be a single integer.
Do you know of an approach that allows a function, like hclust, to access the data within the FBM and piece, by piece, build a hierarchical tree? Or must I appropriately sample the FBM, build a tree, and then append remaining data from the FBM?
I have been working on a Windows machine with 8 gB of RAM - would working on a linux platform make any difference?
post note: Why am I trying to do this? I have been handed some legacy code, written in SAS, that has successfully used the Ward method to hierarchically cluster 42895 observations of 6 variables. I have not been able to find/construct a solution that mirrors this process in R. I have had success using partition-based clustering (from the Kmeans, CLARA packages) – however, it would be great if I could also compare these approaches with the hierarchical approach used in the original SAS code.
Warm regards
Jim Dell