markjacksonfishing / sf_giants_stats

Join me in celebrating my love for the San Francisco Giants! I've crafted a Go program to combine my love for data science and the SF Giants. With this program, you can enter any Major League Baseball team's abbreviation and receive their 2022 season batting stats from Baseball Reference.
MIT License
1 stars 0 forks source link

Include additional factors to improve the prediction accuracy #5

Closed markjacksonfishing closed 1 year ago

markjacksonfishing commented 1 year ago

Description

The current program uses linear regression analysis to predict the number of wins for a given baseball team based on its performance statistics in the previous season. However, there are additional factors that can influence a team's performance, such as the quality of its pitching or defense. Therefore, including these factors can improve the accuracy of the model.

Proposed Solution

To improve the accuracy of the program, we can include additional factors such as pitching statistics and defensive metrics in the regression analysis. These factors can be obtained by scraping relevant data from the team's page on Baseball Reference or other sources. We can also use more sophisticated regression techniques such as multiple linear regression or logistic regression to account for the relationships between multiple independent variables and the dependent variable.

Expected Outcome

By including additional factors and using more sophisticated regression techniques, we can improve the accuracy of the program in predicting the number of wins for a given baseball team. This can help analysts make better predictions and inform decision-making in various fields, including sports analytics.

Additional Information

It is important to consider the trade-off between model complexity and prediction accuracy. Adding too many variables can lead to overfitting and reduced predictive power, while too few variables can result in an oversimplified model that does not capture all the relevant factors. Therefore, it is important to carefully select the variables to include in the model based on their statistical significance and practical relevance.

markjacksonfishing commented 1 year ago

Using my notes here: The error "mat: negative dimension" indicates that a matrix with a negative dimension is being created, which is not allowed. In this case, it seems to be happening in the line where the matrix X is being created from the data slice:

X := mat.NewDense(len(data), len(headers)-1, nil)

The length of the headers slice is being subtracted by 1 to account for the first column of data being used for the dependent variable Y, so if headers is empty or only contains one element, the resulting length would be negative.

To fix this error, I am thinking I should make sure that headers contains at least two elements, or handle the case where it does not separately. I am going to try and modify the code to check the length of headers before creating the matrix, like:

if len(headers) < 2 {
    fmt.Fprintln(output, "Not enough headers found in HTML")
    return
}

X := mat.NewDense(len(data), len(headers)-1, nil)

I think this will prevent the creation of a matrix with a negative dimension and handle the case where there are not enough headers.