oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Histogram #489

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

Sometimes we want to get a feel for a large dataset with many samples beyond knowing just the basic metrics of mean, median, or standard deviation. To get more of a intuitive sense for a dataset, we can use a histogram to display all the values.

A histogram tell us how many values in a dataset fall between different sets of numbers (i.e., how many numbers fall between 0 and 10? Between 10 and 20? Between 20 and 30?. Each of these questions represent a bin, for instance, our first bin might be between 0 and 10.

All bins in a histogram are always the same size. The width of each bin is the distance between the minimum and de maximum values of each bin. In our example, the width of each bin would be 10.

Each bin is represented by a different rectangle whose height is the number of elements from the dataset that fall within that bin.

To make a histogram in Matplotlib, we use the command plt.hist. plt.hist finds the minimum and the maximum values in your dataset and creates 10 equally-spaced bins between those values.

If we want more than 10 bins, we can use the keyword bins to set how many bins we want to divide the data into. For example: plt.hist(dataset, range=(66,69), bins=40). The keyword range selects the minimum and maximum values to plot.

Histograms are best for showing the shape of a data set. For example, you might see that values are close together, or skewed to one side. With this added intuition, we often discover other types of analysis we want to perform.

oldoc63 commented 1 year ago
  1. We've provided data in the files sales_times.csv and loaded it into a list called sales_times. This set represents the 270 sales at MatplotSip's first location from 8am to 10pm on a certain day.
oldoc63 commented 1 year ago
  1. Make a histogram out of this data in histogram.py using the plt.hist function.
oldoc63 commented 1 year ago
  1. Use the bins keyword to create 20 bins instead of the default 10.
oldoc63 commented 1 year ago

Multiple Histograms

If we want to compare two different distributions, we can put multiple histograms on the same plot. This could be useful, for example, in comparing the heights of a bunch of men and the heights of a bunch of women. However, it can be hard to read to histograms on top of each other. We have two ways we can solve a problem like this:

  1. Use the keyword alpha, which can be a value between 0 and 1. This set the transparency of the histogram. A value of 0 would make the bars entirely transparent. A value of 1 would make the bars completely opaque.
  2. Use the keyword histtype with the argument 'step' to draw just the outline of a histogram.

oldoc63 commented 1 year ago

Another problem we face is that our histograms might have different numbers of samples, making one much bigger than the other. We can see how this makes it difficult to compare qualitatively, by adding a dataset b with a much bigger size value:

a = normal(loc=64, scale=2, size=10000)
b = normal(loc=70, scale=2, size=100000)

plt.hist(a, range=(55, 75), bins=20)
plt.hist(b, range=(55, 75), bins=20)
plt.show()

oldoc63 commented 1 year ago

These histograms are very difficult to compare. To solve this, we can normalize using density=True (normed is deprecated). This command divides the height of each column by a constant such that the total shaded area of the histogram sums 1.

oldoc63 commented 1 year ago
  1. We provided another dataset in the file sales_times_s2.csv that represents the 371 sales at MatplotSip's first location from 8am to 10pm on the same day. This data has the same structure as the sales times data from store 1, with an id, a card_no, and a time. Take a look at the data in the csv and familiarize yourself with it. Using script.py, we've imported the times into a list called sales_times2.
  2. Plot the histogram of times from the second location on top on the one from the last exercise.
oldoc63 commented 1 year ago
  1. Notice that the histogram we plotted second completely obscures the first histogram we plotted. Modify the transparency value of both histograms to be 0.4 so that we can see the separate histograms better.
oldoc63 commented 1 year ago
  1. Normalize both the histograms so that we can compare the patterns between them despite the differences in sample size.