Open nsh87 opened 8 years ago
@nsh87 based on what you have already done here, I was thinking that I might branch off and start working on some of the fixes.
Also, we could make multi_clust
run faster if we make some of the *multiClust slots optional. For instance, clustGap
is a pretty slow function. We could give the user the option to turn that off. There are also things like silhouette score that we do not have to return, though I do not know that that would save too much time.
@nsh87 also, tcr (equivalent to single-cell-boolean.csv) ran in about 12 minutes on my computer locally using alllong for the index.
@catterbu: what about cmv-fluidym.csv? that one's I think 6x the number of rows
@nsh87 Yeah. When I start to get ready for the gym I will start it running. Also, I branched off and added the stop
call with an informative message for the distinct data points issue. Next, I will work on adding that histogram of how many clusters are selected by different algorithms
Also, if the speed continues to be an issue, I can go back through NbClust and try to optimize it more. I try not to change anything unless it was necessary, but there were plenty of inefficiencies.
@catterbu: at the moment, I'd say we are going to have to have two options: determine optimal k with NbClust, or use WSS. these are maxing out the CPU, which doesn't seem like a great use of CPU when the user is going to interpret the result. cmv-fluidym is exceeding the 1hr. task time limit and is still running in the background after 1.5hrs. i'm going to try on tcr-boolean again.
@catterbu: obviously let's look at performance improvement first - that needs to create a pretty drastic cut in execution time, though. i'm pretty naive about what is taking a long time, though, so maybe the optimizations you mentioned above would work. even cutting run time in half though might not be enough.
@catterbu: it would be awesome if it can be optimized enough to cut the execution time. there could be two options on the site: fast and exhaustive (I think you know which is which lol).
@nsh87 we were using average silhouette score previously for the optimal number of clusters. We could use that instead of WSS. WSS will just always say that top number in the krange by the nature of how it is calculated.
@catterbu: ya, wasn't thinking! silhouette
The purpose if this Issue is to track tests of the current
dev
version of the package against themaster
version.Why? The current
dev
version contains two significant changes:multi_clust()
function to use NbClust to determine the optimal kmultiClust
object to amultiClust
S4 classThe testing will involve running the same data sets on the staging site (which has the
master
branch of receptormarker installed) and a local dev version (which has thedev
branch installed) and comparing the outcome of various data sets.Checked boxes below indicate that same result is obtained with NbClust as before, and that all the other potential issues described above have not been observed.
Estimate k (select "Replace empty cells with:
0
on the site):index
NbClust should use ('all' or 'alllong') to try 'all' to see if it decreases memory requirements and increases performance - 84778c7...Error in multiclust[["k_best"]] : this S4 class is not subsettable..
. This was due to the clustering task on the frontend server using the old notation for themultiClust
structure. I've updated it to usemulticlust@k_best
, for the new S4 class. Test again.FYI, @catterbu.