swirldev / swirl_courses

:mortar_board: A collection of interactive courses for the swirl R package.
http://swirlstats.com
Other
4.25k stars 7.25k forks source link

R Programming/Looking At Data: summary(plants) output does not match description #526

Open yesezra opened 11 months ago

yesezra commented 11 months ago

Hello! I'm currently going through the R Programming>Looking at Data lesson. I'm in R 4.3.1/RStudio 2023.06.1+524 on macOS 13.5.

After being instructed to try summary(plants), I get the following output:

 Scientific_Name      Duration         Active_Growth_Period Foliage_Color          pH_Min     
 Length:5166        Length:5166        Length:5166          Length:5166        Min.   :3.000  
 Class :character   Class :character   Class :character     Class :character   1st Qu.:4.500  
 Mode  :character   Mode  :character   Mode  :character     Mode  :character   Median :5.000  
                                                                               Mean   :4.997  
                                                                               3rd Qu.:5.500  
                                                                               Max.   :7.000  
                                                                               NA's   :4327   
     pH_Max         Precip_Min      Precip_Max     Shade_Tolerance      Temp_Min_F    
 Min.   : 5.100   Min.   : 4.00   Min.   : 16.00   Length:5166        Min.   :-79.00  
 1st Qu.: 7.000   1st Qu.:16.75   1st Qu.: 55.00   Class :character   1st Qu.:-38.00  
 Median : 7.300   Median :28.00   Median : 60.00   Mode  :character   Median :-33.00  
 Mean   : 7.344   Mean   :25.57   Mean   : 58.73                      Mean   :-22.53  
 3rd Qu.: 7.800   3rd Qu.:32.00   3rd Qu.: 60.00                      3rd Qu.:-18.00  
 Max.   :10.000   Max.   :60.00   Max.   :200.00                      Max.   : 52.00  
 NA's   :4327     NA's   :4338    NA's   :4338                        NA's   :4328  

However, the output is described by the lesson as follows:

Duration (also a factor variable) tells us that our dataset contains 3031 Perennial plants, 682 Annual plants, etc.

This does not match the output, which shows Duration as a character, not factor, variable. This also occurs with Active_Growth_Period, which is described as:

| You can see that R truncated the summary for Active_Growth_Period by including a catch-all
| category called 'Other'. Since it is a categorical/factor variable, we can see how many times
| each value actually occurs in the data with table(plants$Active_Growth_Period).

Perhaps something changed in the dataset or default output of summary, but this is confusing and I'm not sure how to get output that matches the description. Many thanks for maintaining this valuable project!

yesezra commented 11 months ago

In case this is useful for any other beginners finding this issue, I worked around it by coercing the appropriate columns from character vectors into factors:

plants$Active_Growth_Period <- as.factor(plants$Active_Growth_Period)
plants$Duration <- as.factor(plants$Duration)
plants$Foliage_Color <- as.factor(plants$Foliage_Color)
plants$Shade_Tolerance <- as.factor(plants$Shade_Tolerance)
gdickens commented 10 months ago

Just adding my voice here: faced the same issue.

The Scientific_Name variable is a character, not a factor.