swirldev / swirl_courses

:mortar_board: A collection of interactive courses for the swirl R package.
http://swirlstats.com
Other
4.25k stars 7.25k forks source link

Error/Bug in the looking at Data section of the R Programming swirl course #531

Open KurayiChawatama opened 4 months ago

KurayiChawatama commented 4 months ago

In the below snippet of the output of the program in the Looking at Data section, the program is referencing the fact that the Active_Growth_Period has been assigned a catch all category called other, among other references which are not present as the categorical data in this case has all received the "character" value as the value for the summary statistics

| You are doing so well!

  |========================================================                                |  64%
| After previewing the top and bottom of the data, you probably noticed lots of NAs, which are
| R's placeholders for missing values. Use summary(plants) to get a better feel for how each
| variable is distributed and how much of the dataset is missing.

> summary(plants)
 Scientific_Name      Duration         Active_Growth_Period Foliage_Color          pH_Min     
 Length:5166        Length:5166        Length:5166          Length:5166        Min.   :3.000  
 Class :character   Class :character   Class :character     Class :character   1st Qu.:4.500  
 Mode  :character   Mode  :character   Mode  :character     Mode  :character   Median :5.000  
                                                                               Mean   :4.997  
                                                                               3rd Qu.:5.500  
                                                                               Max.   :7.000  
                                                                               NA's   :4327   
     pH_Max         Precip_Min      Precip_Max     Shade_Tolerance      Temp_Min_F    
 Min.   : 5.100   Min.   : 4.00   Min.   : 16.00   Length:5166        Min.   :-79.00  
 1st Qu.: 7.000   1st Qu.:16.75   1st Qu.: 55.00   Class :character   1st Qu.:-38.00  
 Median : 7.300   Median :28.00   Median : 60.00   Mode  :character   Median :-33.00  
 Mean   : 7.344   Mean   :25.57   Mean   : 58.73                      Mean   :-22.53  
 3rd Qu.: 7.800   3rd Qu.:32.00   3rd Qu.: 60.00                      3rd Qu.:-18.00  
 Max.   :10.000   Max.   :60.00   Max.   :200.00                      Max.   : 52.00  
 NA's   :4327     NA's   :4338    NA's   :4338                        NA's   :4328    

| Keep up the great work!

  |============================================================                            |  68%
| summary() provides different output for each variable, depending on its class. For numeric data
| such as Precip_Min, summary() displays the minimum, 1st quartile, median, mean, 3rd quartile,
| and maximum. These values help us understand how the data are distributed.

...

  |===============================================================                         |  72%
| For categorical variables (called 'factor' variables in R), summary() displays the number of
| times each value (or 'level') occurs in the data. For example, each value of Scientific_Name
| only appears once, since it is unique to a specific plant. In contrast, the summary for
| Duration (also a factor variable) tells us that our dataset contains 3031 Perennial plants, 682
| Annual plants, etc.

...

  |===================================================================                     |  76%
| You can see that R truncated the summary for Active_Growth_Period by including a catch-all
| category called 'Other'. Since it is a categorical/factor variable, we can see how many times
| each value actually occurs in the data with table(plants$Active_Growth_Period).