ronvoluted / kaggle-nba

Team repository for the NBA Career Prediction Kaggle Competition from UTS Advanced Data Science for Innovation
https://kaggle.com/c/uts-advdsi-nba-career-prediction
MIT License
0 stars 0 forks source link

Domain knowledge checks #1

Open ronvoluted opened 3 years ago

ronvoluted commented 3 years ago

@rudecat raised some good domain knowledge considerations like the fact that the number of successful scores should never be greater than the number of attempted scores. We should think of more checks like this and create a set of processed datasets or functions/classes.

The same also applies to missing/impossible values etc but you are already doing that. The main gist of this issue is to think of basketball related checks. We may not actually find any errors if Anthony is going easy on us haha.

rudecat commented 3 years ago

@ronvoluted agree. I will be making some utility tools that can be shared across teams. I think we can start with the obvious checks and review the data closer if we're stuck with the performance; although 60% of data analysts time spends on data prep, it's better to do this through iterations. Spending too much time up front might impose prejudice to the machine learning.

Some issues I have noticed so far listed as below

  1. GP, 3PM, 3PA, 3p%, FT%, BLK have negative value as minimum - Convert to absolute value - Done
  2. All % values are bit off and not close to Made/attempt - Consider dropping these fields or recreate them - Done
  3. BLK has some outliers - Fix it or drop these or ignore them
  4. 3PA and FTA has 0 value - Need to ensure 3PM and FTM are also 0 in these cases

What can be shared

  1. Data Preparation
  2. Cross Validation
  3. Evaluation
  4. Result Generation
  5. Utility tools (Like MinMaxScaler, train_test_split ... etc)
  6. Artifacts (Like data dumps)

What cannot be shared

  1. Feature engineering - as it really depends on what features we want to add/remove/manipulate for our models
  2. Modeling - again depends on the models and hyperparameters

What are your thoughts?

ronvoluted commented 3 years ago

Yep only the 'meta' work would be useful for shared use 👍If you like, I can convert your lambdas to defs in the src directory for shared use. I was also planning to write a class to to help compare model that would work like this in theory:

from src.data.compare import Comparison

comp = Comparison(classifier, X_train, X_validation, X_test)
# will pull in the current best.joblib and its split.npy files
# assumes we are working with the same target

# Print comparison of AUC, F1, MSE, MAE
rudecat commented 3 years ago

Sure. I have a basic one, but happy for you to work on the comparison module. My way actually is only passing the model as the rest inputs shall be the same/static. (This is under the assumption that we do feature engineering before experimenting all potential models evaluation, so the evaluation module just prints out the performance of each model using the same processed data)

ronvoluted commented 3 years ago

Oh yeah intent is to compare not only your own models, but models from others who will likely have different feature engineering. As mentioned it was in theory so I'll see how it works in practice.