Question regarding the MOPO algorithm

Hi @takuseno, First of all, thanks for the great work.

I've a question regarding the MOPO algorithm, specifically about the ProbabilisticEnsembleDynamics.

In the original paper, authors state:

Across all domains, we train an ensemble of 7 models and pick the best 5 models based on their prediction error on a hold-out set of 1000 transitions in the offline dataset. Each of the model in the ensemble is parametrized as a 4-layer feedforward neural network with 200 hidden units and after the last hidden layer, the model outputs the mean and variance using a two-head architecture. Spectral normalization is applied to all layers except the head that outputs the model variance.

In order to reproduce the paper, starting from your example in the doc:

from d3rlpy.datasets import get_pendulum
from d3rlpy.dynamics import ProbabilisticEnsembleDynamics
from d3rlpy.metrics.scorer import dynamics_observation_prediction_error_scorer
from d3rlpy.metrics.scorer import dynamics_reward_prediction_error_scorer
from d3rlpy.metrics.scorer import dynamics_prediction_variance_scorer
from sklearn.model_selection import train_test_split

dataset, _ = get_pendulum()

train_episodes, test_episodes = train_test_split(dataset)

dynamics = d3rlpy.dynamics.ProbabilisticEnsembleDynamics(learning_rate=1e-4, use_gpu=True)

# same as algorithms
dynamics.fit(train_episodes,
             eval_episodes=test_episodes,
             n_epochs=100,
             scorers={
                'observation_error': dynamics_observation_prediction_error_scorer,
                'reward_error': dynamics_reward_prediction_error_scorer,
                'variance': dynamics_prediction_variance_scorer,
             })

from d3rlpy.algos import MOPO

# load trained dynamics model
dynamics = ProbabilisticEnsembleDynamics.from_json('<path-to-params.json>/params.json')
dynamics.load_model('<path-to-model>/model_xx.pt')

# give mopo as generator argument.
mopo = MOPO(dynamics=dynamics)

For the models, I can assume that it is simply necessary to provide an appropriate encoder_factory, instead of using the default one
However, with respect to the ensemble, Is there a particular reason why you decided to implement it without the 'pick the best k models out of N models' step (e.g. train 5 models and use all of them instead of taking the best 5 out of 7)?

Am I missing something or can this be a feature to work on?

takuseno / d3rlpy

Question regarding the MOPO algorithm #95