Closed WardLT closed 5 years ago
Thanks @WardLT, this is awesome. I am surprised MLI beats MEI on average; from what I know, typically EI is the most effective acquisition function.
Might be interesting to compare with a Gaussian-Process-based estimator (unless they already did that in the paper or something)?
I think there's a nomenclature issue regarding MLI. E(I) is the notation used by Lookman's group to describe the "expected improvement" probability, but MEI used by Ling is the Maximum Expected Improvement that corresponds to the pure-explotation strategy (Max) in Lookman group publications. TL;DR: MLI and E(I) are actually the same, not MEI and EI.
To make the notebook match the paper, I used the MLI notation. Should I add in a note about that into the notebook?
Regarding Gaussian Processes: I'm currently working with Citrine to compare their RF-based uncertainties to GPR and other techniques. I don't have anything yet, but could add a note to this notebook pointing to the more detailed study once I'm done.
Oh ok, thanks for the clarification. In the wider black box optimization literature, I've seen E(I) is more commonly just abbreviated EI (and likelihood of improvement as PI, probability of improvement). I had not previously heard of these "maximum" strategies in the black box literature, iirc they'd usually just call it "greedy".
Yeah, a note for clarification might help.
Regarding uncertainties, I have found bootstrapped uncertainty estimates to be generally inferior to GP uncertainty estimates if the number of bootstraps isn't pretty big (500-1000 resamples). Although when comparing RF to GP predictive accuracy (i.e., the mean bootstrapped RF predictions vs GP predictions), RF does pretty well. Then again, don't quote me on this cuz that is just my subjective experience in a very limited problem space.
On another note, we actually have an adaptive learning project, although it is more intended for distributed problems.
Sounds good. I'll add in some clarification around the terminology in this notebook.
Good to know about your experience with bootstrapped confidence intervals. We are finding something similar with needing large number of samples to get reasonable CIs. Though, Max Hutchinson and I are working on getting more evidence for that and doing some comparison to GP. I'll keep you in the loop, and would like to get your thoughts once the work is more mature.
Also, thanks for pointing out RocketSled! I'll have to sync up with you on that at some point, especially because my group has been playing around with a Python library for active learning algorithms: https://github.com/WardLT/active-learning/tree/overhaul (still a very rough code)
@ardunn can you merge this one in when it's ready?
(after the minor adjustments from @WardLT )
@computron Yes will do
Alright, I've added clarification that MEI != EI.
@WardLT I think there is an opportunity to demonstrate using a ConversionFeaturizer in this nb, if you are interested :)
Could:
def get_compostion(c):
"""Attempt to parse composition, return None if failed"""
try:
return Composition(c)
except:
return None
data['composition'] = data['chemicalFormula'].apply(get_compostion)
Change to:
from matminer.featurizers.conversion import StrToComposition
stc = StrToComposition(target_col_id='composition')
data = stc.featurize_dataframe(data, "chemicalFormula")
?
If not, I'll just merge as is tho
Thanks for pointing that out! I still haven't fully acquainted myself with the conversion featurizers...
Alright, I've made that change
Added an notebook showing how to use matminer with lolopy to do active learning.