Added active learning notebook

hackingmaterials / matminer_examples

A repo of examples for the matminer (https://github.com/hackingmaterials/matminer) code

Other

103 stars 60 forks source link

Added active learning notebook #49

Closed WardLT closed 5 years ago

WardLT commented 5 years ago

Added an notebook showing how to use matminer with lolopy to do active learning.

ardunn commented 5 years ago

Thanks @WardLT, this is awesome. I am surprised MLI beats MEI on average; from what I know, typically EI is the most effective acquisition function.

Might be interesting to compare with a Gaussian-Process-based estimator (unless they already did that in the paper or something)?

WardLT commented 5 years ago

I think there's a nomenclature issue regarding MLI. E(I) is the notation used by Lookman's group to describe the "expected improvement" probability, but MEI used by Ling is the Maximum Expected Improvement that corresponds to the pure-explotation strategy (Max) in Lookman group publications. TL;DR: MLI and E(I) are actually the same, not MEI and EI.

To make the notebook match the paper, I used the MLI notation. Should I add in a note about that into the notebook?

Regarding Gaussian Processes: I'm currently working with Citrine to compare their RF-based uncertainties to GPR and other techniques. I don't have anything yet, but could add a note to this notebook pointing to the more detailed study once I'm done.

ardunn commented 5 years ago

Oh ok, thanks for the clarification. In the wider black box optimization literature, I've seen E(I) is more commonly just abbreviated EI (and likelihood of improvement as PI, probability of improvement). I had not previously heard of these "maximum" strategies in the black box literature, iirc they'd usually just call it "greedy".

Yeah, a note for clarification might help.

Regarding uncertainties, I have found bootstrapped uncertainty estimates to be generally inferior to GP uncertainty estimates if the number of bootstraps isn't pretty big (500-1000 resamples). Although when comparing RF to GP predictive accuracy (i.e., the mean bootstrapped RF predictions vs GP predictions), RF does pretty well. Then again, don't quote me on this cuz that is just my subjective experience in a very limited problem space.

On another note, we actually have an adaptive learning project, although it is more intended for distributed problems.

WardLT commented 5 years ago

Sounds good. I'll add in some clarification around the terminology in this notebook.

Good to know about your experience with bootstrapped confidence intervals. We are finding something similar with needing large number of samples to get reasonable CIs. Though, Max Hutchinson and I are working on getting more evidence for that and doing some comparison to GP. I'll keep you in the loop, and would like to get your thoughts once the work is more mature.

WardLT commented 5 years ago

Also, thanks for pointing out RocketSled! I'll have to sync up with you on that at some point, especially because my group has been playing around with a Python library for active learning algorithms: https://github.com/WardLT/active-learning/tree/overhaul (still a very rough code)

computron commented 5 years ago

@ardunn can you merge this one in when it's ready?

computron commented 5 years ago

(after the minor adjustments from @WardLT )

ardunn commented 5 years ago

@computron Yes will do

WardLT commented 5 years ago

Alright, I've added clarification that MEI != EI.

ardunn commented 5 years ago

@WardLT I think there is an opportunity to demonstrate using a ConversionFeaturizer in this nb, if you are interested :)

Could:

def get_compostion(c):
    """Attempt to parse composition, return None if failed"""

    try:
        return Composition(c)
    except:
        return None

data['composition'] = data['chemicalFormula'].apply(get_compostion)

Change to:

from matminer.featurizers.conversion import StrToComposition

stc = StrToComposition(target_col_id='composition')
data = stc.featurize_dataframe(data, "chemicalFormula")

If not, I'll just merge as is tho

WardLT commented 5 years ago

Thanks for pointing that out! I still haven't fully acquainted myself with the conversion featurizers...

WardLT commented 5 years ago

Alright, I've made that change