Closed hn2 closed 3 years ago
@hn2 Thanks for the issue. Let me clarify. You generated the dataset by using the trained and fixed policy and use the dataset to train offline RL algorithms. And, you observed that online algorithms such as SAC are performing the best in offline RL settings (not the ones trained in online settings). Is this correct? And, could you explain the task more specifically (e.g. Hopper-v2)?
I believe this is not a bug since even online algorithms are doing good for some datasets.
This is my own custom env. It uses historical trading assets data to try and construct an optimal portfolio for trading. I am using stable baselines for the online algos (td3, sac). I then trained all offline algos ('BCQ', 'BEAR', 'AWR', 'CQL', 'AWAC', 'CRR', 'PLAS', 'MOPO', 'COMBO') for 100 epochs. None of them was better than td3 or sac. Are there any hyperparameters to tune? Is it possible to change topology of nn for example? It is not a bug since your code is working, but it is not doing any better than online models. I am trying to understand how to work with it and potentially how to tune it. Here are some excerpts from my code:
model = v_online_class.load(v_online_model_file_name) dataset = to_mdp_dataset(model.replay_buffer)
v_offline_model = offline_class(use_gpu=torch.cuda.is_available()) v_offline_model.fit(dataset.episodes, n_epochs=n_epoches, experiment_name='experiment1', logdir=v_offline_model_dir) v_offline_model.save_model(fname=v_offline_model_file_name)
Are you comparing the result of offline training with the result of online training? If so, the offline RL algos never overcome the online RL algos. Basically, the online RL is performing the best. The offline RL fits the case where online interaction is not feasible.
Also, of course, there are a bunch of parameters you can tune. Please see documentation. https://d3rlpy.readthedocs.io/en/v0.90/references/algos.html
Isn't m y problem a good suit for offline algos? I basically have offline historical data for assets pricing and want to train an agent to to trade the best possible portfolio.
It sounds like an offline RL problem. But, you're comparing the online RL agent trained online with the offline RL agent trained offline. Is this right? I believe that this comparison is not the right direction.
What is then?
I guess what you can do is comparing the performance of only offline RL algorithms. Sorry, this kind of consultant is not an issue of d3rlpy. If you don't have any technical problems with this software, I'll close this issue.
I tried generating replay buffers using td3 and sac. I tried all available offline algorithms. None gave me better results than online algorithms. Perhaps I am doing something wrong. Can you help? What are the most importing hyperparameters that can affect the results?