vinostroud / nfl_analytics

MIT License
0 stars 0 forks source link

Check training and data sets in Function 4 to ensure no overlap #21

Closed vinostroud closed 4 months ago

vinostroud commented 6 months ago
          Hmmm, so I am comfortable with the code as it is, but you may have found a separate issue on finding accurate regression (which I'll raise as a small issue as well).

I am taking two deliberate steps -- splitting my input/feature (x - epa) into training and test sets, and my target variables (y) into training and test sets. Training includes all but the last 100 entries; test includes the last 400 entries. So it's possible there's some overlap. (.reshape(-1,1) simply turns this into a 2D array/one column, which I need for the regression analysis.

The issue is if there's overlap it can cause bias in the numbers. So I need to relook at why I made the split this way. I'd recommend against constants because the slice will be impacted by the data set size (number of rows in the excel).

_Originally posted by @vinostroud in https://github.com/vinostroud/nfl_analytics/pull/18#discussion_r1609024159_

vinostroud commented 4 months ago

Fixed in last PR, closing!