topepo / FES

Code and Resources for "Feature Engineering and Selection: A Practical Approach for Predictive Models" by Kuhn and Johnson
https://bookdown.org/max/FES
GNU General Public License v2.0
724 stars 237 forks source link

Minor typos in chapters 2 - 4 #15

Closed kviip closed 6 years ago

kviip commented 6 years ago

In version dated "2018-05-12":

Physicians have an strong preference towards logistic regression due to its inherent interpretability.

Should be "a"

It is also interesting to note that the model of the risk set requires all 8 predictors while recursive feature elimination for the risk, imaging predictors and imaging predictor interactions set only requires only 4 predictors to achieve a better cross-validated area under the ROC curve.

One "only" is unnecessary.

These three pairs are highlighted in red boxes along the diagonal of the coorleation matrix in Figure 2.3.

Should be "correlation".

These topics are fairly general with regards to empirical modeling and include: metric for measuring performance for regression and classification problems, approaches for optimal data usage which includes data splitting and resampling, best practices for model tuning, and recommendations for comparing model performance.

Should be "include" not "includes"

While the imbalance hasa significant impact on the analysis, the illustration presented here will mostly side-step this issue by down-sampling the instances such that the number of profiles in each class are equal.

Should be "has a".

The question that one really wants to know is “if my value was predicted to be an event, what is are the chances that it is truly is an event?” or Pr[Y = STEM|P = STEM].

"is" is unnecessary.

Sensitivity (or specificity, depending on one’s point of view) are the “likelihood” parts of this equation.

Should be "is ... part" instead.

Table 3.1 can also be visualized using a mosaic plot such as the one shown in Figure 3.3(b) where the size of the blocks are proportional to the amount of data in each cell.

Should be either "where the sizes ... are" or "where the size ... is".

The mosaic plot for this confusion matrix is shown in Figure 3.3(a) where the blue block in the upper left becomes larger but there is also an increase in the red block in the lower right.

There is no red block in the lower right (probably meant to be upper right).

The test set is used only at the conclusion of these activities for estimating a final, unbiased assessment of the model’s performance. It is critical that the test set not be used prior to this point. Looking at its results will bias the outcomes since the testing data will have become part of the model development process.

"Its" in the sentence above refers to test set which has no "results", hence indicated sentence probably requires rephrasing.

Also, Section 4.4 has a more extensive description of how the assessment datasets can be used to drive improvements to models.

Should be "data sets" instead of "datasets".

This is somewhat of a simplification.

"of" is unnecessary.

For example, in Section 1.1, a transformation procedure was used to modify the predictors variables and this resulted in an improvement in performance.

"predictors variables" is incorrect, I think, it should be "predictor variables", or only "predictors" or "variables".

While the test set data often have the outcome data blinded, it is possible to “train to the test” by only using the training set samples that are most similar to the test set data.

Should be "has" instead of "have".

This is the rate at which coefficients are randomly set to zero during and is most likely to attenuate overfitting (Srivastava et al. 2014).

Unfinished part of sentence probably - "during ..." ?

The learning rate parameter controls the rate of decent during the parameter estimation iterations and these values were contrasted to be between zero and one.

Should be "descent".

Depending on the problem, this bias might over-estimate the model’s true performance.

Should be "overestimate" instead.

kviip commented 6 years ago

The focus of this chapter will be to present approaches for visually exploring data and to demonstrate how this approach can be used to help guide feature engineering.

Both should be "approaches"

KnightAdz commented 6 years ago

Part 4.2.1:

However, if the outcome were transformed prior to modeling, it would ensure than negative ridership could not be predicted. However, if the outcome were transformed prior to modeling, it would ensure that negative ridership could not be predicted.

(as see in Figure 1.7). (as seen in Figure 1.7).

(sidenote: does figure 1.7 show this? I'm guessing that you're referring to the y-axis is in natural units and not log units?)

On station particularly stands out, One station particularly stands out,

KnightAdz commented 6 years ago

Part 4.2.3

to uncover relationships between pairs of predictors, an to understand if to uncover relationships between pairs of predictors, and to understand if

KnightAdz commented 6 years ago

4.3 Visualizations for Categorical Data: Explorating the OkCupid Data 4.3 Visualizations for Categorical Data: Exploring the OkCupid Data

topepo commented 6 years ago

I think of data as plural so "data are" is appropriate.