@liuzhen529 @gcohenfr @estellaqi

Hi Zhen,

I've uploaded a copy of your report with my comments. You can find it here:

Let me know if you have any issues viewing the comments.

In addition, the following are my comments for your R code:

R Code feedback:

Note that extra text is present in the header of the code chunks ({r ..}). Refrain from doing so as this prevents knitting of the .rmd file when headers are the same. Markdown headers are sufficient.
I recommend putting any R packages that are being loaded at the top of the script in its own code chunk, for the purpose of organization. Feel free to add comments to highlight which functions are being used from each package, but it’s not necessary.
Also, now that you’ve identified your final models, proceed to conduct model diagnostics (check the residuals, identify any outliers, etc.).

Data Pre-Processing:

As we previously discussed in the issues, I recommended ignoring observations in which the response is missing. However, I followed up and said that the observation with a 7 instead of a 5 for num_use is fine to keep because it doesn’t affect the responses your group was examining (pvnumM and pvlitM). Even though the observation is missing values for many other variables, I don’t agree with outright eliminating the observation in this case.
Slightly inefficient coding when producing the Q-Q plots and the summary statistics. I recommend using a “for loop” and saving summary results to a list.

Good way to enumerate the missing values present in each variable, but this section only checks for missing values and doesn’t display summary statistics

It’s fine to change the observation’s ED_Level category from 8 to 4. But I’m not sure if it’s acceptable to combine observations from category 3 and 4 together. Although there are only 6 observations in category 4 (after making the above change), consider checking what kind of estimates you obtain in the models, if the categories are not combined.
Recommend using a “for loop” to convert the categorical variables into factors (efficient coding).
Useful commenting and well done identifying that FNFE12JR and FNAET12JR are identical columns.

As a general remark, it might be better to use the column names instead of numbers to filter data. It improves reproducibility of your code. Maybe even select which columns NOT to include.
Since the group decided to look at all of the data together (both public and private sector), I suggest only performing “na.omit” in the linear regression stage. This is because some models may not include “pub_priv” as a variable, meaning the 39 observations with missing data for “pub_priv” can still be used.

Currently, the code doesn’t load the “corrplot” package (only installs it). You only need to code loading the packages though (no need to keep code for the install).
Creative way to plot the correlation matrix so it’s easier to identify low and high correlations. However, I suggest looking into creating a heatmap so each cell is a solid colour instead of a circle. But this is pretty much on the right track.
Good idea to use a function to generate Cramer’s V values (and very creative, since your group was interested in quantifying associations between categorical variables). But it’s confusing to understand what’s going on within the function, so I recommend commenting some details.

Well done labelling the axes and titles for the plots (often wrongfully omitted during exploratory analysis). Maybe use “par’s” “mfrow” parameter to plot the boxplots together.
It seems that some plots were omitted for pvnumM

Everything seems to be in order here, just be aware that you conducted “forward” selection on for all three code chunks. Consider using “exhaustive” to compare the results and if not computationally intensive. I also recommend keeping “trace=T” to understand how variable selection was conducted.

Very efficient coding to make the model comparisons, but please provide some more commenting in these code chunks to make the code clear.
Please use markdown headers, some of the current headers make the .Rmd look disorganized in this section.
The section “2) ANOVA: With vs without one variable” compares the full model against the model without one variable. Is this intentional?