tomr-stargazer / dendrogal

Analysis tools for my dendrogram decomposition of the Milky Way.
2 stars 1 forks source link

Improved mass function fitting #8

Open tomr-stargazer opened 9 years ago

tomr-stargazer commented 9 years ago

Chris

Hi Tom,

The general way to fit distributions without binning is to

1) Turn your cumulative mass function into a differential one (ie P(M=M')) 2) Make sure it's normalized like a probability function (ie it integrates to 1) 3) Determine the best differential mass function fit by maximizing the log-likelihood (ie the sum of of the log of the differential PDF evaluated at each data point) 2) If desired, convert back into a cumulative mass function by integrating

My guess is that Alyssa is unsatisified with the fit :) I do wonder whether that's because the model itself is inconsistent with the data (as opposed to a suboptimal choice of parameters). One very quick way to check that is to make IPython widget sliders for each of the three parameters, to interactively adjust them and see if you can find any setting that follows the data. See e.g. http://stackoverflow.com/questions/27489742/ipython-notebook-interactive-function-how-to-set-the-slider-range

cheers, chris

Erik

Hi all --

I disagree a bit with Chris in that the data are already binned in your code (i.e., by the np.histogram()). To generate an empirical complementary cumulative distribution function, you'd just have x.sort() as the x-data and np.linspace(1,0,len(x)) as the y-data.

I soapboxed in detail about this in http://adsabs.harvard.edu/abs/2005PASP..117.1403R and argued that binning was evil. Chris' method, I believe, will implicitly require some kind of binning in that you have to generate a PDF estimate. The naive way is to normalize a histogram (binning!) or to generate a Gaussian Kernel Density estimate (soft bins). However, the latter provides a good way to account for the significant and variable uncertainties in the x-axis.

I haven't tried to deal with this in Python but there's always https://github.com/low-sky/idl-low-sky/blob/master/eroslib/tpl_errinvar2.pro

Apart from differentiating, Chris point is basically correct though, you can take a theoretical CDF and ask whether a set of parameters maximizes the probability that the data are drawn from the CDF. Formally, you should MCMC this with errors in each of the masses (I have some protocode that does this not-well if you want to go this route).

Finally, the model of the functional form is interesting and I'd like to point out that I was scooped on a great idea not once but twice... in the same week about a better model -- The Power-law log-normal:

http://adsabs.harvard.edu/abs/2015arXiv150302420B http://adsabs.harvard.edu/abs/2015arXiv150301306B

I'm working with my student on a Python implementation of the Pareto log-normal as part of scipy.stats but it's slipped a lot on our priority queue given these publications.

best wishes, e.

Chris

I soapboxed in detail about this in http://adsabs.harvard.edu/abs/2005PASP..117.1403R and argued that binning was evil. Chris' method, I believe, will implicitly require some kind of binning in that you have to generate a PDF estimate

I don't think so (unless I misunderstand what the data look like). My assumption is that you're starting out with a collection of mass measurements, feeding them into a histogram function, and then fitting the output from that. My suggestion is to skip the histogram step, take the raw unbinned mass numbers, and then find the maximum-likelihood PDF.

Erik

Chris points out that we're basically in agreement on approach w.r.t. maximizing the probability drawn from the PDF. It's important to stress that, for a power law, it's poorly normalized unless bounded and the analytic PDF (i.e., what I confused for his data in step 1) is given as a function of parameters of the PDF.

best wishes, e.