qzhu2017 / PyXtal_ml

a Python3 library for ML modeling materials properties
MIT License
11 stars 1 forks source link

yaml file readability #17

Closed yanxon closed 6 years ago

yanxon commented 6 years ago

I would like to improve the readability in yaml file

qzhu2017 commented 6 years ago

@yanxon Please follow the way as described in the README.md.

Ideally, the yaml file should give the default parameters for each algorithm, grid search options.

For the code in method.py, I suggest you include all steps,

This way, we can call the ML method more conveniently without going to the details of each algorithm.

yanxon commented 6 years ago

@qzhu2017

The GridSearchCV automatically does cross validation + fit for us.

Please review method.py, I will add more ml algorithm once the structure is ok. Unless, I have to change a lot if the structure is not optimized.

Here is the list of I change:

  1. 'light' doesn't have any parameters. It uses the default parameters defined by scikit-learn.
  2. 'medium' doesn't have any parameters either. It uses the default parameters defined by scikit-learn and perform cross validation through GridSearchCV.
  3. 'tight' have parameters which is defined in the yaml file.

The params in yaml file is for 'tight' only. Also, I defined 'cv' for K-fold cross validation.

I don't understand "process the features." Can you please explain?

We can use sklearn.model_selection.RandomizedSearchCV if GridSearchCV takes very loooong.

Howard

qzhu2017 commented 6 years ago

@yanxon I am not sure if grid search does the cross validation for us. By cross validation, I mean to split the data to train set and test set many times. As such, we ensure that the model does not rely on the split of data.

qzhu2017 commented 6 years ago

@yanxon also, please rename the file to .yaml, instead of yml

yanxon commented 6 years ago

The grid search does K-Fold cross validation for us. Thus, the name is GridSearchCV.

For example, you have this parameters n_estimators = [1,2,3,4,5], leaf_size = [1,2,3,4,5], cv = 10. GridSearchCV actually does 5 x 5 x 10 = 250 calculations.

Please check out https://www.youtube.com/watch?v=Gol_qOgRqfA

yanxon commented 6 years ago

I changed to .yaml.

qzhu2017 commented 6 years ago

For the current stage, the yaml file looks good.

However, I am not quite sure about the use of cv in GridsearchCV function.

If we use cv=10, it will explore the calculations for 10 times. Do you have the output of r^2 or MAE for each calculation. I suggest we don't just select only the best results from 10 calculations. We also need to provide some information about the variation of these r2/mae values. They can tell us if we can trust these ML models constructed by the medium set.

yanxon commented 6 years ago

Sounds good. I will implement that feature.

yanxon commented 6 years ago

I'm sure it output the r2 results for each CV calculation.