lsst-sims / legacy_sims_maf

LSST Simulations package for the metrics analysis framework (MAF)
13 stars 19 forks source link

U/lynnej/fix summary hist #206

Closed rhiannonlynne closed 3 years ago

rhiannonlynne commented 3 years ago

Fixes a small bug in the summary histogram that showed up when bin widths were uneven.

rhiannonlynne commented 3 years ago

This made me go back to look again at what matplotlib does when plotting histograms, which made me then think I should go back to how I used to plot these histograms where the edges of the bins aligned with the bin values (instead of forcing the steps to go up/down so that the bin center was in the middle of these steps).

The number of the bins is one more than the number of the histogram values, which somehow was making this all more complicated too (using your code snippet, I did include the last bin but the overall length of x was greater than the length of y, so it wouldn't have plotted --- using my version, the lengths were right but the last bin boundary wasn't included.)

I think I need to drop the adjustment for the bin width entirely - this should align more closely with what matplotlib does when plotting and should match what the histogram is calculating better (i.e. the intervals should match). I think I just got confused by the plots before. The reason it was confusing was because typically, this is used when doing something like counting the number of nights with X visits -- X being the bins. So if the bins are set to bins=np.array([0, 1, 3, 10]) you would have the counts of number of nights with 0 visits, 1 visits, or 3-10 visits .. and it gets confusing because then when you plot this, the histogram lines go up at 0/1/3 .. (which is why I was trying to force the bin alignments to make the integer numbers be in the middle, so it was clearer). This seems like the wrong way to go - changing the bins in the first place would work,or just remembering how the numpy histogram function works would also make sense.

rhiannonlynne commented 3 years ago

np.vstack is much faster it looks like!

rhiannonlynne commented 3 years ago

Here's an example of the source of the original confusion image but if I remember how histograms are generated, then the y values always correspond to the x values on the left.

ehneilsen commented 3 years ago

The whole matplotlib approach is more intuitive if you think of the "bins" sent to matplotlib as bin boundries to begin with. To get intuitive histograms of integers, it helps to offset "bins" from integers by a half, eg: bins=np.arange(10)+0.5.

ehneilsen commented 3 years ago

regarding np.vstack: in general, if you ever find yourself converting from numpy arrays to python tuples or lists and back, there is very likely to be a much faster (and often clearer) way to do it entirely within numpy. Conversion between the two data structures has significant overhead, and except for operations that change the length of the list or array, the numpy operations are usually much faster than corresponding list operations anyway. Often, as in the case with the vstack use here, the overall operation is faster even if it does involve creation of arrays with new sizes, because the overhead of converting between lists and arrays is higher than the overhead of changing the array size a few times within numpy.

rhiannonlynne commented 3 years ago

The whole matplotlib approach is more intuitive if you think of the "bins" sent to matplotlib as bin boundries to begin with. To get intuitive histograms of integers, it helps to offset "bins" from integers by a half, eg: bins=np.arange(10)+0.5.

Yes, I think this is what to do.

rhiannonlynne commented 3 years ago

Merged and deleted branch u/lynnej/fix_summaryHist.