Checking usage of NbClust with existing data sets on the dev site

nsh87 commented 8 years ago

The purpose if this Issue is to track tests of the current dev version of the package against the master version.

Why? The current dev version contains two significant changes:

The modification of the multi_clust() function to use NbClust to determine the optimal k
- Has the ability to change results of tests on the same data set
- Has the potential to error more often than before (on various data sets) due to the increased complexity of NbClust
- Has the ability to take an inordinate amount of time due to the number of algorithms tested by NbClust
The transition of the multiClust object to a multiClust S4 class
- Has the ability to break images on the site if the new class is not referenced properly in the code

The testing will involve running the same data sets on the staging site (which has the master branch of receptormarker installed) and a local dev version (which has the dev branch installed) and comparing the outcome of various data sets.

Checked boxes below indicate that same result is obtained with NbClust as before, and that all the other potential issues described above have not been observed.

Estimate k (select "Replace empty cells with: 0 on the site):

[ ] cmv-fluidym.csv
- It has been ~15 mins and the task is still not complete. I don't know if it's still running or if it has hung.
- Update: the local dev machine with 1.5GB RAM has run out of memory, as indicated by Rserve. Increasing RAM to 2GB.
- Also added the ability to dictate what index NbClust should use ('all' or 'alllong') to try 'all' to see if it decreases memory requirements and increases performance - 84778c7.
- Ran again - the task exceeded the 1hr time limit. Which means that although it's still maxing out the CPU of the backend server, the frontend has considered this task to have errored out and will not receive the response even if the task finishes.
[ ] single-cell-boolean.csv
- This is also taking a ridiculous amount of time, after upgrading local site to 2GB RAM. It hasn't finished in ~20 mins. Wait, there is an error: ..Error in multiclust[["k_best"]] : this S4 class is not subsettable... This was due to the clustering task on the frontend server using the old notation for the multiClust structure. I've updated it to use multiclust@k_best, for the new S4 class. Test again.

FYI, @catterbu.

catterbu commented 8 years ago

@nsh87 based on what you have already done here, I was thinking that I might branch off and start working on some of the fixes. Also, we could make multi_clust run faster if we make some of the *multiClust slots optional. For instance, clustGap is a pretty slow function. We could give the user the option to turn that off. There are also things like silhouette score that we do not have to return, though I do not know that that would save too much time.

catterbu commented 8 years ago

@nsh87 also, tcr (equivalent to single-cell-boolean.csv) ran in about 12 minutes on my computer locally using alllong for the index.

nsh87 commented 8 years ago

@catterbu: what about cmv-fluidym.csv? that one's I think 6x the number of rows

catterbu commented 8 years ago

@nsh87 Yeah. When I start to get ready for the gym I will start it running. Also, I branched off and added the stop call with an informative message for the distinct data points issue. Next, I will work on adding that histogram of how many clusters are selected by different algorithms Also, if the speed continues to be an issue, I can go back through NbClust and try to optimize it more. I try not to change anything unless it was necessary, but there were plenty of inefficiencies.

nsh87 commented 8 years ago

@catterbu: at the moment, I'd say we are going to have to have two options: determine optimal k with NbClust, or use WSS. these are maxing out the CPU, which doesn't seem like a great use of CPU when the user is going to interpret the result. cmv-fluidym is exceeding the 1hr. task time limit and is still running in the background after 1.5hrs. i'm going to try on tcr-boolean again.

nsh87 commented 8 years ago

@catterbu: obviously let's look at performance improvement first - that needs to create a pretty drastic cut in execution time, though. i'm pretty naive about what is taking a long time, though, so maybe the optimizations you mentioned above would work. even cutting run time in half though might not be enough.

nsh87 commented 8 years ago

@catterbu: it would be awesome if it can be optimized enough to cut the execution time. there could be two options on the site: fast and exhaustive (I think you know which is which lol).

catterbu commented 8 years ago

@nsh87 we were using average silhouette score previously for the optimal number of clusters. We could use that instead of WSS. WSS will just always say that top number in the krange by the nature of how it is calculated.

nsh87 commented 8 years ago

@catterbu: ya, wasn't thinking! silhouette

nsh87 / receptormarker

Checking usage of NbClust with existing data sets on the dev site #67