Extend modAL to pytorch models

damienlancry commented 5 years ago

Hi

I am a research assistant and I have been working on deep bayesian active learning for the past few weeks. I have been using pytorch and custom active learning classes so far, and i just found out about modAL and it seems very cool. That s why I was wondering if it was possible to extend it to pytorh models. I would be glad to contribute.

more specifically i am using dropout based bayesian neural networks and use monte carlo sampling to compute predictive variance. i am quite new to active learning but i believe deep bayesian active learning is very close to query by committee in the sense that for every x of the unlabeled pool set, there are N feedforward passes of x through a committee of N networks sampled from the posterior distribution over the weights of the bayesian network.

I also experimented with some query strategies for classifiers mentioned in the active learning survey by Burr Settles that I think are not implemented in modAL yet. I would be glad to contribute on this side to. I am think about gini index of the votes, gini index of the consensus, least confident vote, least confident consensus. (In my experiments they perform as well as vote entropy and consensus entropy).

damienlancry commented 5 years ago

after reading part of the code i realize that deep bayesian active leaning is not directly applicable with modal since the Committee class requires a list of learners to be trained separately (correct me if i am wrong). If I am not wrong then I would also be glad to extend modAL to deep bayesian models.

damienlancry commented 5 years ago

Also I just discovered skorch, a sklearn wrapper for pytorch that would be useful to extend modAL to pytorch

cosmic-cortex commented 5 years ago

Hi!

Sorry for the late answer, I was not available until now.

1) Indeed, PyTorch models would work with skorch in theory, although it is not extensively tested. It would be nice to add a test module for skorch compatibility, so this would be continuously integrated. If you would like to contribute to this, I would be very happy to help! Let me know if this is the case.

2) Regarding the Bayesian deep learning: I am not completely familiar with the topic, but my take would be the following. Instead of using the Committee class, you can use the regular ActiveLearner class and write a custom query function, tailored directly to your deep learning method. Connected to this, I am also thinking about rewriting the Committee class directly. Instead of requiring a list of learners, it would be a better design choice to implement a class with predict, predict_proba, vote, vote_proba methods instead of using this in the current way, this new class could be used with the ActiveLearner class.

So, bottom line, I believe that deep Bayesian models can be used, just not with the current Committee class. I would highly encourage you to take a shot with this! If you can provide me some details regarding your models and implementations, I am happy to help!

damienlancry commented 5 years ago

Hi, sorry for my late reply as well

I think your implementation is good because the general committee based active learning framework that is described in the literature prescribes to train an ensemble of classifier separately. On the other hand, to the best of my knowledge, the literature about deep bayesian active learning is not so rich and the comparison I made between deep bayesian active learning and committee based active learning is maybe not as straightforward, but in the few papers i found they use similar acquisition functions.

maybe your Committee class and a new BayesianLearner class should both inherit from an abstract class and share some methods?

In my custom classes, i have a BayesianLearner Class that has a acquire method that looks like this:

def acquire(self, nb_samples, nb_acquired, acquisition_fcn):
            outputs = self._get_outputs(nb_samples=nb_samples)
            acquisition = acquisition_fcn(outputs)
            idx_most_informative = (-acquisition).argsort()[:nb_acquired]
            x_most_informative   = self.X_pool[idx_most_informative].reshape((nb_acquired,-1))
            y_most_informative   = self.Y_pool[idx_most_informative].reshape((nb_acquired,-1))
            self.X = np.append(self.X, x_most_informative, axis=0)
            self.Y = np.append(self.Y, y_most_informative, axis=0)
            self.X_pool = np.delete(self.X_pool, idx_most_informative, axis=0)
            self.Y_pool = np.delete(self.Y_pool, idx_most_informative, axis=0)

Basically in _get_outputs we sample nb_samples set of weights from the posterior distribution of the bayesian neural network, and then we make a feedforward pass of X_pool through each of these sampled networks and store all the results in outputs (which is then a tensor typically of shape (X_pool.shape[0], nb_samples) if the network only has one output neuron). we then evaluate the acquisition_function on outputs and a store the result in a tensor acquistion of shape (X_pool.shape[0],).

Finally the most informative data points are removed from X_pool,Y_pool and added to X,Y

cosmic-cortex commented 5 years ago

Regarding the Committee class, I am thinking about replacing it completely by using estimators from the mlens package. There are some technical details to be worked out, but overall they seem to provide a much more general solution for ensembles of learners.

It seems to me that the acquire function you wrote would work as a query strategy, except the class specific part like removing instances from the pool, etc. (This is handled by modAL objects internally.)

damienlancry commented 5 years ago

I am now using skorch and modAL it works fine! if you could give me direction to write a some test module somewhere, i would gladly do that

cosmic-cortex commented 5 years ago

For this purpose, I propose to start with creating a runnable example (like the ones here), which can be used to create a Jupyter notebook tutorial for the website and also add to the tests later.

damienlancry commented 5 years ago

I opened a pull request #44 giving a runnable example but it seems i am not authorize to contribute

damienlancry commented 5 years ago

thanks for giving me the opportunity to contribute. I like modAL and if I can further help in any way please feel free to contact me!

cosmic-cortex commented 5 years ago

Thank you for your help! Sure, there are lots of planned features which you can help with! I plan to firmly move modAL towards deep learning and Bayesian deep learning more specifically. Since you are an expert on Bayesian deep learning, can you suggest some algorithms and papers where I can start? Which methods are baseline and which are the most reliable algorithms? Of course, if you have time, I would also be happy to work together on implementing these to modAL.

damienlancry commented 5 years ago

I would recommend this paper: Deep Bayesian Active Learning with Image Data I tried to reproduce their results using modAL and it s not working so far but I think it would be cool to add an example of Deep Bayesian Active Learning to modAL!

cosmic-cortex commented 5 years ago

I have read this paper, but I don't understand how the acquisition procedure changes with the dropout approximation. Can you perhaps explain it briefly? Let's assume that we are using the max entropy function. Where does it differ from plugging in the prediction uncertainties directly?

damienlancry commented 5 years ago

The difference is that with a deterministic neural network the entropy would be:

$\Large H=-\sum_{c}p_c\log(p_c)$

With the dropout approximation, the probabilities produced by the stochastic network are averaged over the weights, they are not point estimates any longer, so less prone to be over confident on misclassified examples. The integration is intractable so we perform a Monte Carlo integration by performing several passes through the network with dropout layers activated and averaging the outputs, hence the name MC dropout. The entropy is now:

$\Large H=-\sum_{c}\frac{1}{T}\sum_{t}p_{c}^{t}\log(\frac{1}{T}\sum_{t}p_{c}^{t})$

With T the number of passes through the network. The second figure of the paper shows an improvement compared to the approach with the deterministic CNN.

cosmic-cortex commented 5 years ago

So if I understand it correctly, this means that there is an uniform prior distribution on the weights, right?

damienlancry commented 5 years ago

There is a Bernoulli prior distribution over the weights\ The weights are initialized randomly following Xavier or Glorot or another initialization technique which gives a Dirac distribution (point estimate) and then the prior distribution becomes Bernoulli because of the dropout.

modAL-python / modAL

Extend modAL to pytorch models #39