Open oldoc63 opened 1 year ago
If we want to compare two different distributions, we can put multiple histograms on the same plot. This could be useful, for example, in comparing the heights of a bunch of men and the heights of a bunch of women. However, it can be hard to read to histograms on top of each other. We have two ways we can solve a problem like this:
alpha
, which can be a value between 0 and 1. This set the transparency of the histogram. A value of 0 would make the bars entirely transparent. A value of 1 would make the bars completely opaque. histtype
with the argument 'step'
to draw just the outline of a histogram.Another problem we face is that our histograms might have different numbers of samples, making one much bigger than the other. We can see how this makes it difficult to compare qualitatively, by adding a dataset b with a much bigger size value:
a = normal(loc=64, scale=2, size=10000)
b = normal(loc=70, scale=2, size=100000)
plt.hist(a, range=(55, 75), bins=20)
plt.hist(b, range=(55, 75), bins=20)
plt.show()
These histograms are very difficult to compare. To solve this, we can normalize using density=True
(normed
is deprecated). This command divides the height of each column by a constant such that the total shaded area of the histogram sums 1.
Sometimes we want to get a feel for a large dataset with many samples beyond knowing just the basic metrics of mean, median, or standard deviation. To get more of a intuitive sense for a dataset, we can use a histogram to display all the values.
A histogram tell us how many values in a dataset fall between different sets of numbers (i.e., how many numbers fall between 0 and 10? Between 10 and 20? Between 20 and 30?. Each of these questions represent a bin, for instance, our first bin might be between 0 and 10.
All bins in a histogram are always the same size. The width of each bin is the distance between the minimum and de maximum values of each bin. In our example, the width of each bin would be 10.
Each bin is represented by a different rectangle whose height is the number of elements from the dataset that fall within that bin.
To make a histogram in Matplotlib, we use the command plt.hist. plt.hist finds the minimum and the maximum values in your dataset and creates 10 equally-spaced bins between those values.
If we want more than 10 bins, we can use the keyword bins to set how many bins we want to divide the data into. For example:
plt.hist(dataset, range=(66,69), bins=40)
. The keyword range selects the minimum and maximum values to plot.Histograms are best for showing the shape of a data set. For example, you might see that values are close together, or skewed to one side. With this added intuition, we often discover other types of analysis we want to perform.