Open ronvoluted opened 3 years ago
@ronvoluted agree. I will be making some utility tools that can be shared across teams. I think we can start with the obvious checks and review the data closer if we're stuck with the performance; although 60% of data analysts time spends on data prep, it's better to do this through iterations. Spending too much time up front might impose prejudice to the machine learning.
Some issues I have noticed so far listed as below
What can be shared
What cannot be shared
What are your thoughts?
Yep only the 'meta' work would be useful for shared use 👍If you like, I can convert your lambdas to defs in the src
directory for shared use. I was also planning to write a class to to help compare model that would work like this in theory:
from src.data.compare import Comparison
comp = Comparison(classifier, X_train, X_validation, X_test)
# will pull in the current best.joblib and its split.npy files
# assumes we are working with the same target
# Print comparison of AUC, F1, MSE, MAE
Sure. I have a basic one, but happy for you to work on the comparison module. My way actually is only passing the model as the rest inputs shall be the same/static. (This is under the assumption that we do feature engineering before experimenting all potential models evaluation, so the evaluation module just prints out the performance of each model using the same processed data)
Oh yeah intent is to compare not only your own models, but models from others who will likely have different feature engineering. As mentioned it was in theory so I'll see how it works in practice.
@rudecat raised some good domain knowledge considerations like the fact that the number of successful scores should never be greater than the number of attempted scores. We should think of more checks like this and create a set of processed datasets or functions/classes.
/data/preprocessed/
Code may go in/src/data/
/src/features/
The same also applies to missing/impossible values etc but you are already doing that. The main gist of this issue is to think of basketball related checks. We may not actually find any errors if Anthony is going easy on us haha.