getting rid of R - Githubissues

sweverett / CluStR

Calculates various scaling relations from cluster catalogs.

5 stars 4 forks source link

getting rid of R #32

Closed paigemkelly closed 4 years ago

paigemkelly commented 4 years ago

where is R being used in the code? is it only in reglib.py lines 9-16?

sweverett commented 4 years ago

I believe it's only in reglib.py, but it includes all the rpy2 imports, lines 9-16, as well as what's done in the run_lrgs() function. So we actually don't need anything in lines 1-75 except for the numpy and linmix imports.

paigemkelly commented 4 years ago

okay great, we just put that straight into the clustr.py code. We are having trouble figuring out the run_mcmc part of that function and the .chain part. I haven't seen that before and I'm having trouble relating what I'm finding on the internet to what's actually going on in the code. From what I'm understanding run_mcmc runs linmix through Nmin = 5000 to Nmax = 10000. What do Nmin and Nmax represent and how does that apply to linmix? or should we not worry about it?

sweverett commented 4 years ago

I explained this a bit in #33. In principle we would try out every possible combination of values for (slope, intercept, intrinsic_scatter) and use the combination that gives the 'best fit' to the data. But this is quickly a computational nightmare as the time it takes to check all combination grows as the number of grid points ^ # of parameters. So instead there are smarter ways of sampling the parameter space of (slope, intercept, intrinsic_scatter), one of which is a Monte-Carlo Markov Chain (MCMC). It does a random walk in parameter space in a weighted way that will lead to it sampling better fit values than bad fit values. We can talk about this more in a zoom meeting if you'd like.

sweverett commented 4 years ago

But it doesn't give a very reliable estimate if you only sample a few points. So you want to run until it 'converges', which is actually a bit of a complicated subject. Setting a minimum number of samples makes sure it has found the "correct" part of parameter space to sample from and the max number of samples will force it to quit if it's struggling to do the fit after a long time.

paigemkelly commented 4 years ago

I think that makes sense, but just to clarify it will run until at least 5000 samples (Nmin) to make sure its reliable, and it will quit after 10000 (Nmax) if it's struggling? A zoom meeting would be helpful. I'll send an email, thanks!

sweverett commented 4 years ago

Yes, that's correct. Those values aren't necessarily reliable however - it's always best to inspect the resulting chains by eye. We will do that eventually.