Trouble reproducing results with same tolerance level for different versions

cbmporter commented 5 years ago

Hello,

I am trying to reproduce my colleagues results. She ran FitGoM using CountClust version 1.4.1, with the tolerance parameter "tol" set to 10. She had many iterations until convergence with this tolerance value. I ran her exact script using CountClust version 1.6.1 with tolerance set to 10, and had zero iterations until convergence (immediately returned "done"), and the topics generated when I ran the script do not look similar to hers. When I run her script using a tolerance of 0.1, the topics look much more similar, and I have a more reasonable number of iterations until convergence. Are there any updates between version 1.4.1 and 1.6.1 that you know of that would lead to this behavior? If not, do you have any other ideas as to what might be going on? We are both running with R 3.4.

Thanks so much, Caroline

kkdey commented 5 years ago

@cbmporter The fact that it immediately returns done when tol=10, means that the tolerance of 10 is very high for this problem.Actually the smaller the tolerance, the better, because tolerance gives a sense of the convergence of the successive iterations. The tolerance of 0.01 is more desirable overall, but it is set to 10 by default, because for most large dimensional problems tol of 10 and 0.01 look very similar, but tol for 10 is faster than tol 0.01. In your case, the problem may have smaller number of samples or features which would demand a lower tolerance. If speed is not the issue I would definitely suggest keeping tolerance at 0.01 or even smaller

cbmporter commented 5 years ago

@kkdey Thanks for your response. I didn't realize that a tolerance of 0.01 or even smaller is typically preferred, though I have found a tolerance of 0.01 to produce more interesting results in my project. This is helpful to know, but it doesn't actually answer the question that I posted. My colleague and I are running the same script, with the exact same inputs to FitGoM (same data set). She is running at a tolerance of 10, and the number of iterations to convergence is reasonable - and looks more like the number of iterations I get when I input this dataset into FitGoM with a tolerance of 0.1. When I use her tolerance value of 10, I don't see any iterations until convergence. Our concern and confusion is that we are running the identical script and data set on two different versions of CountClust and seeing very different results for the same tolerance value. We aren't sure if this is connected to the CountClust version, or another package CountClust is dependent on, or something else. Thank you for your further thought on this issue.

kkdey commented 5 years ago

Are you using the same version of maptpx? It may be because one of you is using the older version if it. The newest version can be installed from Github

library(devtools)
install_github('TaddyLab/maptpx')

as mentioned in README. The CRAN version of maptpx is older and that has a higher default tolerance, so if one of you is using the CRAN version and the other the Github version, then there is a chance you may see this. I would suggest using the Github version as the CRAN version is buggy.

cbmporter commented 5 years ago

@kkdey Thanks, I will check this out and report back.

pcarbo commented 3 years ago

@cbmporter Thank you for your interest in CountClust, and for posting this issue.

We are developing a new R package, fastTopics, that has most of CountClust's features, plus several important improvements, most notabaly model fitting algorithms that are much faster and more accurate.

As this package is in active development, we welcome questions and feedback (on GitHub or by email).

kkdey / CountClust

Trouble reproducing results with same tolerance level for different versions #39