yaml file readability - Githubissues

qzhu2017 / PyXtal_ml

a Python3 library for ML modeling materials properties

MIT License

11 stars 1 forks source link

yaml file readability #17

Closed yanxon closed 6 years ago

yanxon commented 6 years ago

I would like to improve the readability in yaml file

qzhu2017 commented 6 years ago

@yanxon Please follow the way as described in the README.md.

Ideally, the yaml file should give the default parameters for each algorithm, grid search options.

For the code in method.py, I suggest you include all steps,

split the data to training and test sets
choose the estimator
grid search
process the features
fit and predict
cross validation

This way, we can call the ML method more conveniently without going to the details of each algorithm.

yanxon commented 6 years ago

@qzhu2017

The GridSearchCV automatically does cross validation + fit for us.

Please review method.py, I will add more ml algorithm once the structure is ok. Unless, I have to change a lot if the structure is not optimized.

Here is the list of I change:

'light' doesn't have any parameters. It uses the default parameters defined by scikit-learn.
'medium' doesn't have any parameters either. It uses the default parameters defined by scikit-learn and perform cross validation through GridSearchCV.
'tight' have parameters which is defined in the yaml file.

The params in yaml file is for 'tight' only. Also, I defined 'cv' for K-fold cross validation.

I don't understand "process the features." Can you please explain?

We can use sklearn.model_selection.RandomizedSearchCV if GridSearchCV takes very loooong.

Howard

qzhu2017 commented 6 years ago

@yanxon I am not sure if grid search does the cross validation for us. By cross validation, I mean to split the data to train set and test set many times. As such, we ensure that the model does not rely on the split of data.

qzhu2017 commented 6 years ago

@yanxon also, please rename the file to .yaml, instead of yml

yanxon commented 6 years ago

The grid search does K-Fold cross validation for us. Thus, the name is GridSearchCV.

For example, you have this parameters n_estimators = [1,2,3,4,5], leaf_size = [1,2,3,4,5], cv = 10. GridSearchCV actually does 5 x 5 x 10 = 250 calculations.

Please check out https://www.youtube.com/watch?v=Gol_qOgRqfA

yanxon commented 6 years ago

I changed to .yaml.

qzhu2017 commented 6 years ago

For the current stage, the yaml file looks good.

However, I am not quite sure about the use of cv in GridsearchCV function.

If we use cv=10, it will explore the calculations for 10 times. Do you have the output of r^2 or MAE for each calculation. I suggest we don't just select only the best results from 10 calculations. We also need to provide some information about the variation of these r2/mae values. They can tell us if we can trust these ML models constructed by the medium set.

yanxon commented 6 years ago

Sounds good. I will implement that feature.

yanxon commented 6 years ago

I'm sure it output the r2 results for each CV calculation.