Summary

Here, place a 1-paragraph summary that outlines (1) what the project investigated, (2) what insights/conclusions they found, and (3) what is the next planned step.

This project investigated if seeding is one not only seemingly accurate but also what cases a winner will likely win and how you can use the lifetime data from those players about those specific variables to seed players. He might not have specifically thought of that last point but I believe that is something that this lends itself towards. Possibly the next step is having players increase the specific variables that help win the games with low p values.

Data Preparation

Here, describe (1) what tables have been developed and what kind of information they hold; (2) answer: does the portfolio demonstrate that it has tidy organization? (3) answer: does the portfolio demonstrate cleaned data?. If any of these answers are NO or could be improved to make it easier for the general public to understand, provide specific guidance on how it could be improved.

He developed a table that had all the matches in it for a given year. The year in question was I believe 2017. The tables initially held just player info and match info but was added to later with points and court surface types. I would say that it does have tidy organization. I would say the data was clean. He removed what he didn't need.

Modeling

Here, describe (1) what predictive models have been built and what are their (dependent variable) predictors?; (2) answer: does the portfolio accurately describe the purpose of the models? (3) answer: does the portfolio accurately interpret the model's summary?

His predictive model had the variable being compared against was winner_total_points_won but didn't include match_id or tourney_slug because they are just names. I do believe it does accurately describe the purpose of the model. It does look like the portfolio accurately interprets the data received from the model

Validation

Here, answer: (1) has a model been cross-validated using testing and training sets? (2) has the accuracy of the cross-validation been explained clearly and appropriately?

He did use a testing and training set. Yes he describes the different reasons as to why this data is statistically significant

R Proficiency

Here, describe the strengths and weaknesses of how the R code has been developed; is it easy to read and understand? Have appropriate R techniques been used to make the code easy to maintain and reuse? Have appropriate functional programming techniques been used?

He described what he was doing well between the R code chunks and the R code looked fine. He seeded the testing/training set and everything seems to be easily reproducible if you gather the data from the links he provided. I don't believe he really needed to functionalize anything because nothing was really repeated all that often

Communication

Has the portfolio been described in enough detail, but in wording that is easy for anyone to understand? Are visualizations used effectively to help communicate the data? What are its strengths and weaknesses?

While the portfolio definitely was a wall of text it was text that was split up into easy paragraph chunks that allow for it to be put down and re-picked up at a later time. The visualizations were interesting but my only gripe was the neon green kind of color that my screen was outputting. The strength of the graphs were in his explanation of them

Critical Thinking

Does the operationalization and social impact demonstrate careful, critical thought about the future of the project? What are possible unintended consequences or variables that the author has not discussed?

There wasn't much operationalization to be done on this data set because it was mostly a question of "Is this good" and it appears that answer is yes.

Data Preparation and Modeling (19 out of 20%)

My data was tidied and cleaned well and my model contains all proper predictors. However, some more visualizations to go alongside the model could have been beneficial.

Validation and Operationalization (19 out of 20%)

Cross-validation was performed and interpreted properly on multiple iterations of the model, though the operationalization leaves a bit to be desired. I genuinely believe that there was not much room for operationalization from the data I gathered, but that's possibly an indication that I should have considered a more useful model.

R Proficiency (19 out of 20%)

Aside from the for loops for adding surfaces to my model, my R code is very efficient.

Communication (19 out of 20%)

I'd like to think I explain my reasoning and train of thought well enough that it is easy to follow even for someone who is not as familiar with statistical analysis. However, it's possible that someone who knows nothing about tennis might get a bit lost in my analysis.

Critical Thinking (20 out of 20%)

I think that the questions I ask and the methods in which I go about investigating them show a solid understanding of both my subject matter and statistical principles in general. I'm particularly proud of realizing and handling the fact that using point data directly would lead to improper analysis, choosing to use point ratios instead.

introdsci / DataScience-mnmenator

Final Review #7