Domain knowledge checks

ronvoluted commented 3 years ago

@rudecat raised some good domain knowledge considerations like the fact that the number of successful scores should never be greater than the number of attempted scores. We should think of more checks like this and create a set of processed datasets or functions/classes.

If the way we process failed checks is the same for ALL training, we should create a preprocessed dataset. e.g. All team members want problem rows to be removed for use in every model. This will go in /data/preprocessed/ Code may go in /src/data/
If what we do with problem rows may vary, we should create a function or class. e.g.
- When Sampath is using logistic regression he may want rows removed
- When Sampath is using random forests he may want problem values replaced with the median
- When Kai is using random forests he may way want problem values replaced with 0 This will go in /src/features/

The same also applies to missing/impossible values etc but you are already doing that. The main gist of this issue is to think of basketball related checks. We may not actually find any errors if Anthony is going easy on us haha.

rudecat commented 3 years ago

@ronvoluted agree. I will be making some utility tools that can be shared across teams. I think we can start with the obvious checks and review the data closer if we're stuck with the performance; although 60% of data analysts time spends on data prep, it's better to do this through iterations. Spending too much time up front might impose prejudice to the machine learning.

Some issues I have noticed so far listed as below

GP, 3PM, 3PA, 3p%, FT%, BLK have negative value as minimum - Convert to absolute value - Done
All % values are bit off and not close to Made/attempt - Consider dropping these fields or recreate them - Done
BLK has some outliers - Fix it or drop these or ignore them
3PA and FTA has 0 value - Need to ensure 3PM and FTM are also 0 in these cases

What can be shared

Data Preparation
Cross Validation
Evaluation
Result Generation
Utility tools (Like MinMaxScaler, train_test_split ... etc)
Artifacts (Like data dumps)

What cannot be shared

Feature engineering - as it really depends on what features we want to add/remove/manipulate for our models
Modeling - again depends on the models and hyperparameters

What are your thoughts?

ronvoluted commented 3 years ago

Yep only the 'meta' work would be useful for shared use 👍If you like, I can convert your lambdas to defs in the src directory for shared use. I was also planning to write a class to to help compare model that would work like this in theory:

from src.data.compare import Comparison

comp = Comparison(classifier, X_train, X_validation, X_test)
# will pull in the current best.joblib and its split.npy files
# assumes we are working with the same target

# Print comparison of AUC, F1, MSE, MAE

rudecat commented 3 years ago

Sure. I have a basic one, but happy for you to work on the comparison module. My way actually is only passing the model as the rest inputs shall be the same/static. (This is under the assumption that we do feature engineering before experimenting all potential models evaluation, so the evaluation module just prints out the performance of each model using the same processed data)

ronvoluted commented 3 years ago

Oh yeah intent is to compare not only your own models, but models from others who will likely have different feature engineering. As mentioned it was in theory so I'll see how it works in practice.

ronvoluted / kaggle-nba

Domain knowledge checks #1