opendilab / DI-engine

OpenDILab Decision AI Engine. The Most Comprehensive Reinforcement Learning Framework B.P.
https://di-engine-docs.readthedocs.io
Apache License 2.0
3k stars 367 forks source link

Trading deploy - issues when trying to process a single window #727

Closed prowgrammmer closed 1 year ago

prowgrammmer commented 1 year ago

So I have the following script:

# import gym
import torch
from easydict import EasyDict
from ding.config import compile_config
# from ding.envs import DingEnvWrapper
from ding.policy import single_env_forward_wrapper, DQNPolicy
from ding.model import DQN
from dizoo.gym_anytrading.envs import StocksEnv

stocks_dqn_config = dict(
    exp_name='stocks_dqn_seed0',
    env=dict(
        # Whether to use shared memory. Only effective if "env_manager_type" is 'subprocess'
        # Env number respectively for collector and evaluator.
        collector_env_num=8,
        evaluator_env_num=8,
        env_id='stocks-v0',
        n_evaluator_episode=8,
        stop_value=2,
        # one trading year.
        eps_length=1,
        # associated with the feature length.
        window_size=20,
        # the path to save result image.
        save_path='./fig/',
        # the raw data file name
        stocks_data_filename='STOCKS_GOOGL_deploy',
        # the stocks range percentage used by train/test.
        # if one of them is None, train & test set will use all data by default.
        train_range=None,
        test_range=None,
    ),
    policy=dict(
        # Whether to use cuda for network.
        cuda=True,
        model=dict(
            obs_shape=62,
            action_shape=5,
            encoder_hidden_size_list=[128],
            head_layer_num=1,
            # Whether to use dueling head.
            dueling=True,
        ),
        # Reward's future discount factor, aka. gamma.
        discount_factor=0.99,
        # How many steps in td error.
        nstep=5,
        # learn_mode config
        learn=dict(
            update_per_collect=10,
            batch_size=64,
            learning_rate=0.001,
            # Frequency of target network update.
            target_update_freq=100,
            ignore_done=True,
        ),
        # collect_mode config
        collect=dict(
            # You can use either "n_sample" or "n_episode" in collector.collect.
            # Get "n_sample" samples per collect.
            n_sample=64,
            # Cut trajectories into pieces with length "unroll_len".
            unroll_len=1,
        ),
        # command_mode config
        other=dict(
            # Epsilon greedy with decay.
            eps=dict(
                # Decay type. Support ['exp', 'linear'].
                type='exp',
                start=0.95,
                end=0.1,
                decay=50000,
            ),
            replay_buffer=dict(replay_buffer_size=100000, )
        ),
    ),
)
env_config = stocks_dqn_config["env"]
stocks_dqn_config = EasyDict(stocks_dqn_config)
main_config = stocks_dqn_config

stocks_dqn_create_config = dict(
    env=dict(
        type='stocks-v0',
        import_names=['dizoo.gym_anytrading.envs.stocks_env'],
    ),
    env_manager=dict(type='base'),
    policy=dict(
        type='dqn',
    ),
    evaluator=dict(
        type='trading_interaction',
        import_names=['dizoo.gym_anytrading.worker'],
        ),
)
stocks_dqn_create_config = EasyDict(stocks_dqn_create_config)
create_config = stocks_dqn_create_config

def main(main_config: EasyDict, create_config: EasyDict, ckpt_path: str):
    main_config.exp_name = 'stocks_dqn_deploy'
    cfg = compile_config(main_config, create_cfg=create_config, auto=True)

    # env = DingEnvWrapper(gym.make('stocks-v0'), EasyDict(env_wrapper='default'))
    env = StocksEnv(EasyDict(env_config))

    model = DQN(**cfg.policy.model)
    state_dict = torch.load(ckpt_path, map_location='cpu')
    model.load_state_dict(state_dict['model'])
    policy = DQNPolicy(cfg.policy, model=model).eval_mode
    forward_fn = single_env_forward_wrapper(policy.forward)

    obs = env.reset()
    returns = 0.
    counter = 0
    while True:
        counter += 1
        action = forward_fn(obs)
        print(action)
        obs, rew, done, info = env.step(action)
        # print(obs, rew, done, info)
        returns += rew
        if done:
            break
    print(f'Deploy is finished, final epsiode return is: {returns}')
    # print(counter)

if __name__ == "__main__":
    main(main_config, create_config, 'C:/Users/user/anaconda3/envs/py38/Lib/site-packages/dizoo/gym_anytrading/config/stocks_dqn_seed0_230910_115331/ckpt/ckpt_best.pth.tar')

I've put eps_length to 1 because I only want to process a single window to get the corresponding trade action and I've limited 'STOCKS_GOOGL_deploy' to 20 rows since the window_size is 20.

I get the following error:

Traceback (most recent call last):

File ~\anaconda3\envs\py38\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec exec(code, globals, locals)

File c:\users\user\documents\python scripts\rl_trading\trading_deploy.py:132 main(main_config, create_config, 'C:/Users/user/anaconda3/envs/py38/Lib/site-packages/dizoo/gym_anytrading/config/stocks_dqn_seed0_230910_115331/ckpt/ckpt_best.pth.tar')

File c:\users\user\documents\python scripts\rl_trading\trading_deploy.py:115 in main obs = env.reset()

File ~\anaconda3\envs\py38\lib\site-packages\dizoo\gym_anytrading\envs\trading_env.py:137 in reset self.prices, self.signal_features, self.feature_dim_len = self._process_data(start_idx)

File ~\anaconda3\envs\py38\lib\site-packages\dizoo\gym_anytrading\envs\stocks_env.py:75 in _process_data self.start_idx = np.random.randint(self.window_size, len(self.df) - self._cfg.eps_length)

File mtrand.pyx:763 in numpy.random.mtrand.RandomState.randint

File _bounded_integers.pyx:1338 in numpy.random._bounded_integers._rand_int32

ValueError: low >= high

The issue presents itself in the following line self.start_idx = np.random.randint(self.window_size, len(self.df) - self._cfg.eps_length) I've debugged the values and they are as follows:

self.window_size = 20 len(self.df) = 21 (??? shouldn't it be 20 since there's only 20 rows in my "STOCKS_GOOGL_deploy"? why is it one larger?) eps_length = 1

But so the issue is that window size will always be larger than window_size - eps_length, hence the error. I'm not sure what exactly the logic is behind this, but what would be the proper way of processing a single window to get its corresponding action? Preferably without having to load in the model each time but giving me the ability to feed windows to be processed as my price data comes in.

Cloud-Pku commented 1 year ago

You found a bug, self.start_idx should be calculated like this. self.start_idx = np.random.randint(self.window_size - 1, len(self.df) - self._cfg.eps_length) I will fix it later. But regarding “len(self.df)”, my test result shows it equals 20. Could you paste the content of “STOCKS_GOOGL_deploy.csv” here?

And if you want window_size = 20 and eps_length = 1, then it must have at least 21 data points because timestamp needs one step forward.

prowgrammmer commented 1 year ago

ok so now with a dataframe of 21 rows I get self.window_size - 1 = 20 and len(self.df) - self._cfg.eps_length = 20 but np.random.randint(20, 20) still gives me the same error.

and I still wonder if there's a better way of doing what I'm trying to do so that I can feed single windows and get the corresponding actions without having to reload the whole model?

Cloud-Pku commented 1 year ago

Is it supposed to be window_size = 21 this time? I think what you need here is example like below: window_size = 20, eps_length = 1, len(self.df) = 21 so, self.window_size - 1 = 19 and len(self.df) - self._cfg.eps_length = 20, and start_index = np.random.randint(19, 20)

prowgrammmer commented 1 year ago

Ah there was a debugging issue on my end, I got it working now.

So then is there a better way of feeding single windows to get the corresponding actions without having to reload the whole model like I'm doing currently?

Cloud-Pku commented 1 year ago

Maybe you need to modify the TradingEnv class so that it can handle a single window. You can write a function and I think most of the logic in it should be consistent with env.step. Besides a nonzero reward will be given If and only if the following situations occur: image So the env will step at least twice.

You can learn more from here.