Save and load agents - Githubissues

Please, any suggestion to save trained agents like for example: 19.recurrent-curiosity-q-learning-agent?

Thanks a lot!!

Hello Lucho!

You can do it with > pip install joblib:

from sklearn.externals import joblib

joblib_file = "joblib_model.pkl"  

joblib.dump(model, joblib_file)

C:\Users\Peaq\Anaconda3\lib\site-packages\sklearn\externals\joblib\__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=FutureWarning)

['joblib_model.pkl']

After this you can load the model in the same way:

joblib_model = joblib.load(joblib_file)

joblib_model

<__main__.Model at 0x208b0ec9b88>

Thanks a lot for your answer but in 19.recurrent-curiosity-q-learning-agent there is not a "model", there is an "agent" and what I need is to save that agent trained. So trying

joblib.dump(agent, joblib_file)

I get the following error:

'can't pickle _thread.RLock objects'

Thanks

Hello Lucho!

The method I showed you is for the 6.evolution-strategy-agent.ipynb which is also used in the realtime-agent.

If you want to save the trained agent of the 19.recurrent-curiosity-q-learning-agent you can try with HDF5 file format like shown in this blog: https://www.geeksforgeeks.org/ml-saving-a-deep-learning-model-in-keras/?ref=rp

Hope it helps this time

Thanks for your answer waudinio!! I will try that way

@Lucho00gh Did you find a solution for saving and reloading the agent? I'm trying to figure this out too. The article that @waudinio27 posted (I think) is for saving sequential models, but I think what we're after is trying to save model sessions/graphs?

I'm using one of the actor critic scripts which is similar to the agent stuff. I currently have it set to save the .meta, .index and checkpoint files as it trains, but looking for a way to reload those files, in a separate script, to make predictions with, but having a hard time figuring out what to do.

If you have some code to share, please do. I'm going to keep trying and post what I find out. Currently, my saving of the models just looks like this:

self.saver = tf.train.Saver(max_to_keep=1)
# Then, during training, use this...
save_path = self.saver.save(self.sess, os.path.join(self.output_folder, 'saved_models', self.stock_name + "_model"))

Then when I try and use .restore() in the secondary script, the model either doesn't make any predictions at all, or they're randomized each time the script is run. I'm working on a reproducible example using a script from this repository that hopefully someone can shed some light on. Stay tuned.

Here's how I got @waudinio27 's solution to work. Basically, I created 2 .py scripts (using 6.evolution-strategy-agent), a train and a forward. The train is basically default, I just changed the iterations to 200 instead of 500. Then I added the joblib import to the top (with two different ways of importing it, depending on the versions you have), and during training, it monitors the reward at each checkpoint. If the current checkpoint's reward is greater than the best reward so far, dump the model into the joblib file.

Then, in the second script, I kept everything the same as train, just commented out anything pertaining to training, and near the bottom, defined the joblib model as the model for the agent. Here's the train.py and forward.py scripts for everyone's reference. I'm wondering, can this same thing be used for the agent/TF session agents as well?

train.py

import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import seaborn as sns
import random
sns.set()

import pkg_resources
import types
# from sklearn.externals import joblib
import joblib

def get_imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            name = val.__name__.split('.')[0]
        elif isinstance(val, type):
            name = val.__module__.split('.')[0]
        poorly_named_packages = {'PIL': 'Pillow', 'sklearn': 'scikit-learn'}
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]
        yield name

imports = list(set(get_imports()))
requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name != 'pip':
        requirements.append((m.project_name, m.version))

for r in requirements:
    print('{}=={}'.format(*r))

df = pd.read_csv('../dataset/GOOG-year.csv')
print(df.head())

class Deep_Evolution_Strategy:

    inputs = None

    def __init__(
        self, weights, reward_function, population_size, sigma, learning_rate
    ):
        self.weights = weights
        self.reward_function = reward_function
        self.population_size = population_size
        self.sigma = sigma
        self.learning_rate = learning_rate

    def _get_weight_from_population(self, weights, population):
        weights_population = []
        for index, i in enumerate(population):
            jittered = self.sigma * i
            weights_population.append(weights[index] + jittered)
        return weights_population

    def get_weights(self):
        return self.weights

    def train(self, epoch = 100, print_every = 1):
        lasttime = time.time()
        last_best_reward = 0

        for i in range(epoch):
            population = []
            rewards = np.zeros(self.population_size)
            for k in range(self.population_size):
                x = []
                for w in self.weights:
                    x.append(np.random.randn(*w.shape))
                population.append(x)
            for k in range(self.population_size):
                weights_population = self._get_weight_from_population(
                    self.weights, population[k]
                )
                rewards[k] = self.reward_function(weights_population)
            rewards = (rewards - np.mean(rewards)) / (np.std(rewards) + 1e-7)
            for index, w in enumerate(self.weights):
                A = np.array([p[index] for p in population])
                self.weights[index] = (
                    w
                    + self.learning_rate
                    / (self.population_size * self.sigma)
                    * np.dot(A.T, rewards).T
                )
            if (i + 1) % print_every == 0:
                print(
                    'iter %d. reward: %f'
                    % (i + 1, self.reward_function(self.weights))
                )
                # Check if current checkpoint beat the last best checkpoint.
                # Save model if so
                if self.reward_function(self.weights) > last_best_reward:
                    joblib_file = "joblib_model.pkl"  
                    joblib.dump(model, joblib_file)
                    last_best_reward = self.reward_function(self.weights)
                    print('Saved new model.')
        print('time taken to train:', time.time() - lasttime, 'seconds')

class Model:
    def __init__(self, input_size, layer_size, output_size):
        self.weights = [
            np.random.randn(input_size, layer_size),
            np.random.randn(layer_size, output_size),
            np.random.randn(1, layer_size),
        ]

    def predict(self, inputs):
        feed = np.dot(inputs, self.weights[0]) + self.weights[-1]
        decision = np.dot(feed, self.weights[1])
        return decision

    def get_weights(self):
        return self.weights

    def set_weights(self, weights):
        self.weights = weights

class Agent:

    POPULATION_SIZE = 15
    SIGMA = 0.1
    LEARNING_RATE = 0.03

    def __init__(self, model, window_size, trend, skip, initial_money):
        self.model = model
        self.window_size = window_size
        self.half_window = window_size // 2
        self.trend = trend
        self.skip = skip
        self.initial_money = initial_money
        self.es = Deep_Evolution_Strategy(
            self.model.get_weights(),
            self.get_reward,
            self.POPULATION_SIZE,
            self.SIGMA,
            self.LEARNING_RATE,
        )

    def act(self, sequence):
        decision = self.model.predict(np.array(sequence))
        return np.argmax(decision[0])

    def get_state(self, t):
        window_size = self.window_size + 1
        d = t - window_size + 1
        block = self.trend[d : t + 1] if d >= 0 else -d * [self.trend[0]] + self.trend[0 : t + 1]
        res = []
        for i in range(window_size - 1):
            res.append(block[i + 1] - block[i])
        return np.array([res])

    def get_reward(self, weights):
        initial_money = self.initial_money
        starting_money = initial_money
        self.model.weights = weights
        state = self.get_state(0)
        inventory = []
        quantity = 0
        for t in range(0, len(self.trend) - 1, self.skip):
            action = self.act(state)
            next_state = self.get_state(t + 1)

            if action == 1 and starting_money >= self.trend[t]:
                inventory.append(self.trend[t])
                starting_money -= close[t]

            elif action == 2 and len(inventory):
                bought_price = inventory.pop(0)
                starting_money += self.trend[t]

            state = next_state
        return ((starting_money - initial_money) / initial_money) * 100

    def fit(self, iterations, checkpoint):
        self.es.train(iterations, print_every = checkpoint)

    def buy(self):
        initial_money = self.initial_money
        state = self.get_state(0)
        starting_money = initial_money
        states_sell = []
        states_buy = []
        inventory = []
        for t in range(0, len(self.trend) - 1, self.skip):
            action = self.act(state)
            next_state = self.get_state(t + 1)

            if action == 1 and initial_money >= self.trend[t]:
                inventory.append(self.trend[t])
                initial_money -= self.trend[t]
                states_buy.append(t)
                print('day %d: buy 1 unit at price %f, total balance %f'% (t, self.trend[t], initial_money))

            elif action == 2 and len(inventory):
                bought_price = inventory.pop(0)
                initial_money += self.trend[t]
                states_sell.append(t)
                try:
                    invest = ((close[t] - bought_price) / bought_price) * 100
                except:
                    invest = 0
                print(
                    'day %d, sell 1 unit at price %f, investment %f %%, total balance %f,'
                    % (t, close[t], invest, initial_money)
                )
            state = next_state

        invest = ((initial_money - starting_money) / starting_money) * 100
        total_gains = initial_money - starting_money
        return states_buy, states_sell, total_gains, invest

close = df.Close.values.tolist()
window_size = 30
skip = 1
initial_money = 10000

model = Model(input_size = window_size, layer_size = 500, output_size = 3)
agent = Agent(model = model, 
              window_size = window_size,
              trend = close,
              skip = skip,
              initial_money = initial_money)
agent.fit(iterations = 200, checkpoint = 10)

states_buy, states_sell, total_gains, invest = agent.buy()

fig = plt.figure(figsize = (15,5))
plt.plot(close, color='r', lw=2.)
plt.plot(close, '^', markersize=10, color='m', label = 'buying signal', markevery = states_buy)
plt.plot(close, 'v', markersize=10, color='k', label = 'selling signal', markevery = states_sell)
plt.title('total gains %f, total investment %f%%'%(total_gains, invest))
plt.legend()
plt.show()

forward.py

import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import seaborn as sns
import random
sns.set()

import pkg_resources
import types
# from sklearn.externals import joblib
import joblib

def get_imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            name = val.__name__.split('.')[0]
        elif isinstance(val, type):
            name = val.__module__.split('.')[0]
        poorly_named_packages = {'PIL': 'Pillow', 'sklearn': 'scikit-learn'}
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]
        yield name

imports = list(set(get_imports()))
requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name != 'pip':
        requirements.append((m.project_name, m.version))

for r in requirements:
    print('{}=={}'.format(*r))

df = pd.read_csv('../dataset/GOOG-year.csv')
print(df.head())

class Deep_Evolution_Strategy:

    inputs = None

    def __init__(
        self, weights, reward_function, population_size, sigma, learning_rate
    ):
        self.weights = weights
        self.reward_function = reward_function
        self.population_size = population_size
        self.sigma = sigma
        self.learning_rate = learning_rate

    def _get_weight_from_population(self, weights, population):
        weights_population = []
        for index, i in enumerate(population):
            jittered = self.sigma * i
            weights_population.append(weights[index] + jittered)
        return weights_population

    def get_weights(self):
        return self.weights

    # def train(self, epoch = 100, print_every = 1):
    #     lasttime = time.time()
    #     last_best_reward = 0

    #     for i in range(epoch):
    #         population = []
    #         rewards = np.zeros(self.population_size)
    #         for k in range(self.population_size):
    #             x = []
    #             for w in self.weights:
    #                 x.append(np.random.randn(*w.shape))
    #             population.append(x)
    #         for k in range(self.population_size):
    #             weights_population = self._get_weight_from_population(
    #                 self.weights, population[k]
    #             )
    #             rewards[k] = self.reward_function(weights_population)
    #         rewards = (rewards - np.mean(rewards)) / (np.std(rewards) + 1e-7)
    #         for index, w in enumerate(self.weights):
    #             A = np.array([p[index] for p in population])
    #             self.weights[index] = (
    #                 w
    #                 + self.learning_rate
    #                 / (self.population_size * self.sigma)
    #                 * np.dot(A.T, rewards).T
    #             )
    #         if (i + 1) % print_every == 0:
    #             print(
    #                 'iter %d. reward: %f'
    #                 % (i + 1, self.reward_function(self.weights))
    #             )
    #             # Check if current checkpoint beat the last best checkpoint.
    #             # Save model if so
    #             if self.reward_function(self.weights) > last_best_reward:
    #                 joblib_file = "joblib_model.pkl"  
    #                 joblib.dump(model, joblib_file)
    #                 last_best_reward = self.reward_function(self.weights)
    #                 print('Saved model.')
    #     print('time taken to train:', time.time() - lasttime, 'seconds')

class Model:
    def __init__(self, input_size, layer_size, output_size):
        self.weights = [
            np.random.randn(input_size, layer_size),
            np.random.randn(layer_size, output_size),
            np.random.randn(1, layer_size),
        ]

    def predict(self, inputs):
        feed = np.dot(inputs, self.weights[0]) + self.weights[-1]
        decision = np.dot(feed, self.weights[1])
        return decision

    def get_weights(self):
        return self.weights

    def set_weights(self, weights):
        self.weights = weights

class Agent:

    POPULATION_SIZE = 15
    SIGMA = 0.1
    LEARNING_RATE = 0.03

    def __init__(self, model, window_size, trend, skip, initial_money):
        self.model = model
        self.window_size = window_size
        self.half_window = window_size // 2
        self.trend = trend
        self.skip = skip
        self.initial_money = initial_money
        self.es = Deep_Evolution_Strategy(
            self.model.get_weights(),
            self.get_reward,
            self.POPULATION_SIZE,
            self.SIGMA,
            self.LEARNING_RATE,
        )

    def act(self, sequence):
        decision = self.model.predict(np.array(sequence))
        return np.argmax(decision[0])

    def get_state(self, t):
        window_size = self.window_size + 1
        d = t - window_size + 1
        block = self.trend[d : t + 1] if d >= 0 else -d * [self.trend[0]] + self.trend[0 : t + 1]
        res = []
        for i in range(window_size - 1):
            res.append(block[i + 1] - block[i])
        return np.array([res])

    def get_reward(self, weights):
        initial_money = self.initial_money
        starting_money = initial_money
        self.model.weights = weights
        state = self.get_state(0)
        inventory = []
        quantity = 0
        for t in range(0, len(self.trend) - 1, self.skip):
            action = self.act(state)
            next_state = self.get_state(t + 1)

            if action == 1 and starting_money >= self.trend[t]:
                inventory.append(self.trend[t])
                starting_money -= close[t]

            elif action == 2 and len(inventory):
                bought_price = inventory.pop(0)
                starting_money += self.trend[t]

            state = next_state
        return ((starting_money - initial_money) / initial_money) * 100

    # def fit(self, iterations, checkpoint):
    #     self.es.train(iterations, print_every = checkpoint)

    def buy(self):
        initial_money = self.initial_money
        state = self.get_state(0)
        starting_money = initial_money
        states_sell = []
        states_buy = []
        inventory = []
        for t in range(0, len(self.trend) - 1, self.skip):
            action = self.act(state)
            next_state = self.get_state(t + 1)

            if action == 1 and initial_money >= self.trend[t]:
                inventory.append(self.trend[t])
                initial_money -= self.trend[t]
                states_buy.append(t)
                print('day %d: buy 1 unit at price %f, total balance %f'% (t, self.trend[t], initial_money))

            elif action == 2 and len(inventory):
                bought_price = inventory.pop(0)
                initial_money += self.trend[t]
                states_sell.append(t)
                try:
                    invest = ((close[t] - bought_price) / bought_price) * 100
                except:
                    invest = 0
                print(
                    'day %d, sell 1 unit at price %f, investment %f %%, total balance %f,'
                    % (t, close[t], invest, initial_money)
                )
            state = next_state

        invest = ((initial_money - starting_money) / starting_money) * 100
        total_gains = initial_money - starting_money
        return states_buy, states_sell, total_gains, invest

close = df.Close.values.tolist()
window_size = 30
skip = 1
initial_money = 10000

# model = Model(input_size = window_size, layer_size = 500, output_size = 3)

joblib_file = "joblib_model.pkl" 
joblib_model = joblib.load(joblib_file)

agent = Agent(model = joblib_model, 
              window_size = window_size,
              trend = close,
              skip = skip,
              initial_money = initial_money)
# agent.fit(iterations = 200, checkpoint = 10)

states_buy, states_sell, total_gains, invest = agent.buy()

fig = plt.figure(figsize = (15,5))
plt.plot(close, color='r', lw=2.)
plt.plot(close, '^', markersize=10, color='m', label = 'buying signal', markevery = states_buy)
plt.plot(close, 'v', markersize=10, color='k', label = 'selling signal', markevery = states_sell)
plt.title('total gains %f, total investment %f%%'%(total_gains, invest))
plt.legend()
plt.show()

Hi windowshopr!

I tried something similar as you did. Creating two functions with saver and restore_saver, but I was not able to reproduce my trained results. Then I realized that this part of the code was introducing randomness to the model so I tried to quit it but I didn't get expected results:

if np.random.rand() < self.EPSILON: action = np.random.randint(self.OUTPUT_SIZE)

I'm sorry I can't help you. If you finally solve this please share.

Thanks @Lucho00gh , maybe this will help.

I've made a "train.py" and a "test.py" script using the 15.actor-critic-duel-agent, which should work the same for any of the other TF session agents as well. I've kept everything default in the "train.py" script, except that I've changed a few things to start saving models. Note, I created this on a Windows 10, Tensorflow 2 environment, so some editing may need to be done on the users end to get it working for them based on their own versions.

Some notes about the changes are:

I added setting global random seeds for Tensorflow and Numpy at the beginning of both scripts to get reproducible results in the "test.py" script. I also added one for the "random" dependency, but I commented it out because that affects the Epsilon more than anything, and I would like to keep that part of the training fairly random.
I decreased the Epsilon and Min Epsilon to 0.1 and 0.01 respectively. This is in an effort to get (close to) the same reproducible results in the test.py script.
I defined the saver object in the init of the Agent, and then added the saver.save() into the checkpoints of the training portion of the script further down. User will want to play with this so that maybe only a new model is saved whenever a better cost value is reached, not just blindly saving after every checkpoint, but this is to just keep things as simple as possible.
Then in the test script, I restore the previous model/checkpoint, and just trimmed out anything related to training. I changed the "select action" function to only reference the decision table (with no random epsilon actions) and set the same seed number at the beginning of that script.

This will now make the test.py script create the same results each time it is run, but it may be a tad off from the training run, depending on what model was saved at what epoch during training. The user will want to play with WHEN a model is best saved, and maybe decrease (or eliminate) the epsilon even further which would make the test.py return similar results as training, but I think having at least some exploratory actions is best when training.

This was the best I could come up with, maybe someone more educated than me in saving/restoring sessions can offer some more insight. Train.py and Test.py scripts are below. Hope it helps!

train.py

import numpy as np
import pandas as pd
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
tf.disable_eager_execution()

# or if you're using TF v1
# import tensorflow as tf

import matplotlib.pyplot as plt
import seaborn as sns
from collections import deque
import random, os

sns.set()

# Set random seeds for reproducible results
tf.random.set_random_seed(42)
np.random.seed(42)
# random.seed(42)

df = pd.read_csv('../dataset/GOOG-year.csv')
df.head()

class Actor:
    def __init__(self, name, input_size, output_size, size_layer):
        with tf.variable_scope(name):
            self.X = tf.placeholder(tf.float32, (None, input_size))
            feed_actor = tf.layers.dense(self.X, size_layer, activation = tf.nn.relu)
            tensor_action, tensor_validation = tf.split(feed_actor,2,1)
            feed_action = tf.layers.dense(tensor_action, output_size)
            feed_validation = tf.layers.dense(tensor_validation, 1)
            self.logits = feed_validation + tf.subtract(feed_action,
                                                        tf.reduce_mean(feed_action,axis=1,keep_dims=True))

class Critic:
    def __init__(self, name, input_size, output_size, size_layer, learning_rate):
        with tf.variable_scope(name):
            self.X = tf.placeholder(tf.float32, (None, input_size))
            self.Y = tf.placeholder(tf.float32, (None, output_size))
            self.REWARD = tf.placeholder(tf.float32, (None, 1))
            feed_critic = tf.layers.dense(self.X, size_layer, activation = tf.nn.relu)
            tensor_action, tensor_validation = tf.split(feed_critic,2,1)
            feed_action = tf.layers.dense(tensor_action, output_size)
            feed_validation = tf.layers.dense(tensor_validation, 1)
            feed_critic = feed_validation + tf.subtract(feed_action,tf.reduce_mean(feed_action,axis=1,keep_dims=True))
            feed_critic = tf.nn.relu(feed_critic) + self.Y
            feed_critic = tf.layers.dense(feed_critic, size_layer//2, activation = tf.nn.relu)
            self.logits = tf.layers.dense(feed_critic, 1)
            self.cost = tf.reduce_mean(tf.square(self.REWARD - self.logits))
            self.optimizer = tf.train.AdamOptimizer(learning_rate).minimize(self.cost)

class Agent:

    LEARNING_RATE = 0.001
    BATCH_SIZE = 32
    LAYER_SIZE = 256
    OUTPUT_SIZE = 3
    EPSILON = 0.1
    DECAY_RATE = 0.5
    MIN_EPSILON = 0.01
    GAMMA = 0.99
    MEMORIES = deque()
    MEMORY_SIZE = 300
    COPY = 1000
    T_COPY = 0

    def __init__(self, state_size, window_size, trend, skip):
        self.state_size = state_size
        self.window_size = window_size
        self.half_window = window_size // 2
        self.trend = trend
        self.skip = skip
        tf.reset_default_graph()
        self.actor = Actor('actor-original', self.state_size, self.OUTPUT_SIZE, self.LAYER_SIZE)
        self.actor_target = Actor('actor-target', self.state_size, self.OUTPUT_SIZE, self.LAYER_SIZE)
        self.critic = Critic('critic-original', self.state_size, self.OUTPUT_SIZE, self.LAYER_SIZE, self.LEARNING_RATE)
        self.critic_target = Critic('critic-target', self.state_size, self.OUTPUT_SIZE, 
                                    self.LAYER_SIZE, self.LEARNING_RATE)
        self.grad_critic = tf.gradients(self.critic.logits, self.critic.Y)
        self.actor_critic_grad = tf.placeholder(tf.float32, [None, self.OUTPUT_SIZE])
        weights_actor = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='actor')
        self.grad_actor = tf.gradients(self.actor.logits, weights_actor, -self.actor_critic_grad)
        grads = zip(self.grad_actor, weights_actor)
        self.optimizer = tf.train.AdamOptimizer(self.LEARNING_RATE).apply_gradients(grads)
        self.sess = tf.InteractiveSession()
        self.sess.run(tf.global_variables_initializer())

        # Added saver object
        self.saver = tf.train.Saver(max_to_keep=1)

    def _assign(self, from_name, to_name):
        from_w = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=from_name)
        to_w = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=to_name)
        for i in range(len(from_w)):
            assign_op = to_w[i].assign(from_w[i])
            self.sess.run(assign_op)

    def _memorize(self, state, action, reward, new_state, dead):
        self.MEMORIES.append((state, action, reward, new_state, dead))
        if len(self.MEMORIES) > self.MEMORY_SIZE:
            self.MEMORIES.popleft()

    def _select_action(self, state):
        if np.random.rand() < self.EPSILON:
            action = np.random.randint(self.OUTPUT_SIZE)
        else:
            prediction = self.sess.run(self.actor.logits, feed_dict={self.actor.X:[state]})[0]
            action = np.argmax(prediction)
        return action

    def _construct_memories_and_train(self, replay):
        states = np.array([a[0] for a in replay])
        new_states = np.array([a[3] for a in replay])
        Q = self.sess.run(self.actor.logits, feed_dict={self.actor.X: states})
        Q_target = self.sess.run(self.actor_target.logits, feed_dict={self.actor_target.X: states})
        grads = self.sess.run(self.grad_critic, feed_dict={self.critic.X:states, self.critic.Y:Q})[0]
        self.sess.run(self.optimizer, feed_dict={self.actor.X:states, self.actor_critic_grad:grads})

        rewards = np.array([a[2] for a in replay]).reshape((-1, 1))
        rewards_target = self.sess.run(self.critic_target.logits, 
                                       feed_dict={self.critic_target.X:new_states,self.critic_target.Y:Q_target})
        for i in range(len(replay)):
            if not replay[0][-1]:
                rewards[i] += self.GAMMA * rewards_target[i]
        cost, _ = self.sess.run([self.critic.cost, self.critic.optimizer], 
                                feed_dict={self.critic.X:states, self.critic.Y:Q, self.critic.REWARD:rewards})
        return cost

    def get_state(self, t):
        window_size = self.window_size + 1
        d = t - window_size + 1
        block = self.trend[d : t + 1] if d >= 0 else -d * [self.trend[0]] + self.trend[0 : t + 1]
        res = []
        for i in range(window_size - 1):
            res.append(block[i + 1] - block[i])
        return np.array(res)

    def buy(self, initial_money):
        starting_money = initial_money
        states_sell = []
        states_buy = []
        inventory = []
        state = self.get_state(0)
        for t in range(0, len(self.trend) - 1, self.skip):
            action = self._select_action(state)
            next_state = self.get_state(t + 1)

            if action == 1 and initial_money >= self.trend[t]:
                inventory.append(self.trend[t])
                initial_money -= self.trend[t]
                states_buy.append(t)
                print('day %d: buy 1 unit at price %f, total balance %f'% (t, self.trend[t], initial_money))

            elif action == 2 and len(inventory):
                bought_price = inventory.pop(0)
                initial_money += self.trend[t]
                states_sell.append(t)
                try:
                    invest = ((close[t] - bought_price) / bought_price) * 100
                except:
                    invest = 0
                print(
                    'day %d, sell 1 unit at price %f, investment %f %%, total balance %f,'
                    % (t, close[t], invest, initial_money)
                )

            state = next_state
        invest = ((initial_money - starting_money) / starting_money) * 100
        total_gains = initial_money - starting_money
        return states_buy, states_sell, total_gains, invest

    def train(self, iterations, checkpoint, initial_money):
        for i in range(iterations):
            total_profit = 0
            inventory = []
            state = self.get_state(0)
            starting_money = initial_money
            for t in range(0, len(self.trend) - 1, self.skip):
                if (self.T_COPY + 1) % self.COPY == 0:
                    self._assign('actor-original', 'actor-target')
                    self._assign('critic-original', 'critic-target')

                action = self._select_action(state)
                next_state = self.get_state(t + 1)

                if action == 1 and starting_money >= self.trend[t]:
                    inventory.append(self.trend[t])
                    starting_money -= self.trend[t]

                elif action == 2 and len(inventory) > 0:
                    bought_price = inventory.pop(0)
                    total_profit += self.trend[t] - bought_price
                    starting_money += self.trend[t]

                invest = ((starting_money - initial_money) / initial_money)

                self._memorize(state, action, invest, next_state, starting_money < initial_money)
                batch_size = min(len(self.MEMORIES), self.BATCH_SIZE)
                state = next_state
                replay = random.sample(self.MEMORIES, batch_size)
                cost = self._construct_memories_and_train(replay)
                self.T_COPY += 1
                self.EPSILON = self.MIN_EPSILON + (1.0 - self.MIN_EPSILON) * np.exp(-self.DECAY_RATE * i)
            if (i+1) % checkpoint == 0:
                print('epoch: %d, total rewards: %f.3, cost: %f, total money: %f'%(i + 1, total_profit, cost,
                                                                                  starting_money))

                # Added saving model
                self.saver.save(self.sess, os.path.join('saved_models', 'GOOG-year' + "_model"))

close = df.Close.values.tolist()
initial_money = 10000
window_size = 30
skip = 1
batch_size = 32
agent = Agent(state_size = window_size, 
              window_size = window_size, 
              trend = close, 
              skip = skip)
agent.train(iterations = 200, checkpoint = 10, initial_money = initial_money)

states_buy, states_sell, total_gains, invest = agent.buy(initial_money = initial_money)

test.py

import numpy as np
import pandas as pd
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
tf.disable_eager_execution()

# or if you're using TF v1
# import tensorflow as tf

import matplotlib.pyplot as plt
import seaborn as sns
from collections import deque
import random, os

sns.set()

# Set the same random seed numbers that was used in train.py
tf.random.set_random_seed(42)
np.random.seed(42)
# random.seed(42)

df = pd.read_csv('../dataset/GOOG-year.csv')
df.head()

class Actor:
    def __init__(self, name, input_size, output_size, size_layer):
        with tf.variable_scope(name):
            self.X = tf.placeholder(tf.float32, (None, input_size))
            feed_actor = tf.layers.dense(self.X, size_layer, activation = tf.nn.relu)
            tensor_action, tensor_validation = tf.split(feed_actor,2,1)
            feed_action = tf.layers.dense(tensor_action, output_size)
            feed_validation = tf.layers.dense(tensor_validation, 1)
            self.logits = feed_validation + tf.subtract(feed_action,
                                                        tf.reduce_mean(feed_action,axis=1,keep_dims=True))

class Critic:
    def __init__(self, name, input_size, output_size, size_layer, learning_rate):
        with tf.variable_scope(name):
            self.X = tf.placeholder(tf.float32, (None, input_size))
            self.Y = tf.placeholder(tf.float32, (None, output_size))
            self.REWARD = tf.placeholder(tf.float32, (None, 1))
            feed_critic = tf.layers.dense(self.X, size_layer, activation = tf.nn.relu)
            tensor_action, tensor_validation = tf.split(feed_critic,2,1)
            feed_action = tf.layers.dense(tensor_action, output_size)
            feed_validation = tf.layers.dense(tensor_validation, 1)
            feed_critic = feed_validation + tf.subtract(feed_action,tf.reduce_mean(feed_action,axis=1,keep_dims=True))
            feed_critic = tf.nn.relu(feed_critic) + self.Y
            feed_critic = tf.layers.dense(feed_critic, size_layer//2, activation = tf.nn.relu)
            self.logits = tf.layers.dense(feed_critic, 1)
            self.cost = tf.reduce_mean(tf.square(self.REWARD - self.logits))
            self.optimizer = tf.train.AdamOptimizer(learning_rate).minimize(self.cost)

class Agent:

    LEARNING_RATE = 0.001
    BATCH_SIZE = 32
    LAYER_SIZE = 256
    OUTPUT_SIZE = 3

    def __init__(self, state_size, window_size, trend, skip):
        self.state_size = state_size
        self.window_size = window_size
        self.half_window = window_size // 2
        self.trend = trend
        self.skip = skip
        # tf.reset_default_graph()
        self.actor = Actor('actor-original', self.state_size, self.OUTPUT_SIZE, self.LAYER_SIZE)
        self.actor_target = Actor('actor-target', self.state_size, self.OUTPUT_SIZE, self.LAYER_SIZE)
        self.critic = Critic('critic-original', self.state_size, self.OUTPUT_SIZE, self.LAYER_SIZE, self.LEARNING_RATE)
        self.critic_target = Critic('critic-target', self.state_size, self.OUTPUT_SIZE, 
                                    self.LAYER_SIZE, self.LEARNING_RATE)
        self.grad_critic = tf.gradients(self.critic.logits, self.critic.Y)
        self.actor_critic_grad = tf.placeholder(tf.float32, [None, self.OUTPUT_SIZE])
        weights_actor = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='actor')
        self.grad_actor = tf.gradients(self.actor.logits, weights_actor, -self.actor_critic_grad)
        grads = zip(self.grad_actor, weights_actor)
        self.optimizer = tf.train.AdamOptimizer(self.LEARNING_RATE).apply_gradients(grads)
        self.sess = tf.InteractiveSession()
        self.sess.run(tf.global_variables_initializer())

    def _select_action(self, state):
        prediction = self.sess.run(self.actor.logits, feed_dict={self.actor.X:[state]})[0]
        action = np.argmax(prediction)
        return action

    def get_state(self, t):
        window_size = self.window_size + 1
        d = t - window_size + 1
        block = self.trend[d : t + 1] if d >= 0 else -d * [self.trend[0]] + self.trend[0 : t + 1]
        res = []
        for i in range(window_size - 1):
            res.append(block[i + 1] - block[i])
        return np.array(res)

    def buy(self, initial_money):

        saver = tf.train.import_meta_graph("./saved_models/" + "GOOG-year" + "_model.meta")
        saver.restore(self.sess, tf.compat.v1.train.latest_checkpoint("./saved_models/")) # + self.stock_name + "_model_" + number))

        starting_money = initial_money
        states_sell = []
        states_buy = []
        inventory = []
        state = self.get_state(0)
        for t in range(0, len(self.trend) - 1, self.skip):
            action = self._select_action(state)
            next_state = self.get_state(t + 1)

            if action == 1 and initial_money >= self.trend[t]:
                inventory.append(self.trend[t])
                initial_money -= self.trend[t]
                states_buy.append(t)
                print('day %d: buy 1 unit at price %f, total balance %f'% (t, self.trend[t], initial_money))

            elif action == 2 and len(inventory):
                bought_price = inventory.pop(0)
                initial_money += self.trend[t]
                states_sell.append(t)
                try:
                    invest = ((close[t] - bought_price) / bought_price) * 100
                except:
                    invest = 0
                print(
                    'day %d, sell 1 unit at price %f, investment %f %%, total balance %f,'
                    % (t, close[t], invest, initial_money)
                )

            state = next_state
        invest = ((initial_money - starting_money) / starting_money) * 100
        total_gains = initial_money - starting_money
        return states_buy, states_sell, total_gains, invest

close = df.Close.values.tolist()
initial_money = 10000
window_size = 30
skip = 1
batch_size = 32
agent = Agent(state_size = window_size, 
              window_size = window_size, 
              trend = close, 
              skip = skip)

states_buy, states_sell, total_gains, invest = agent.buy(initial_money = initial_money)

Hello, @Lucho00gh and @windowshopr!

I will try to find a new solution for saving the q-learning agent. I have an idea that I will also post here. Did you resolve how to do the live trading? I have some issues with this. Could you contact me via mail or would you two be interested in setting up a Telegram group? Together things resolve faster.

Best Regards

@waudinio27 thanks, and yes keep us posted with what you create.

The test.py script I posted above would be what is used for “live” predictions. So you train on your training dataset using “train.py”, then test the now saved model (or get your live predictions) using the test.py script. All you have to do is make sure you have at least “window_size” rows in your test dataset, plus the row(s) you want to predict on. So in my example above, you would need 31 total rows in your test dataset (30 for the model’s window size, and 1 for the most recent date), and whatever it prints out as the last action (buy or sell) is your predicted action. Hopefully that makes sense.

For majority of the agent based models, the author have a get_state function however it seems to take input from Day D to Day T+1 when you are suppose to predict a buy/sell signal on Day T

Isn't this using the future price to predict?

@windowshopr @waudinio27 @Lucho00gh

No. The t + 1 I think you’re referring to is used for the .iloc function to grab the correct section of time series data from the array. When it doubt, add print outs to those variables and you’ll see how it works. 👍

Hey windowshopr, you are right! The get_state function is not forward forecasting, the t + 1 is correct. Thanks!

@windowshopr When you tried out the evolution methods last time, have you thought of scaling the data? In supervised learning, this is a very common approach either to scale/normalize it, however I am unsure about agent based learning.

How should one go about it?

Yes you should scale the dataset, but I think what you’re going to find with the evolution model is that they have a bad habit of overfitting to the training dataset, i.e. you’ll train and save the model on a portion of your training set, and then when you test it on another portion of the same dataset, it falls apart and makes horrible predictions. This could be due to limited training, but I had better luck with some of the other agents, however I haven’t looked at this repo in so long. Maybe you’ll find a better way of using it.

for scaling the data, it depends. If you scale using % returns, you’ll lose the potential price levels that momentum traders like to use. What I do when I scale data is, any price column, or technical indicator column that uses price data (like say a moving average), I use a constant number that’s greater than the max of all those columns, and divide the data by that number. This makes sure all price data is scaled using the same scale(?), then for any other indicator, I make sure to scale them using a scaler from sklearn, but making sure the range is roughly the same as with the price data so that each indicator is considered equally during training.

Hope that helps, good luck.

Thanks for the thorough reply! I appreciate it.

Yup that was what I found for the Deep Evolution angel. When I train the dataset on a period of lets say 2010-2019 and then save the weights for that model, the returns were good at around 600% (over 10 years). However after I loaded the weights onto an unseen period of 2020-2021, the results were horrible. Most of the time the results are negative.

I was thinking of using RobustScaler from sklearn to scale the close price but here is where I am unsure. I can create a new column with the scaled Prices however in the Agent Class, my buy function is still defined by the actual close price.

Do I just train and then fit the model on normalized price but then make the buy/sell signals using actual close? Or am I supposed to use normalized price for both?

Yes! The test dataset should be on the same scale as the training set (I think) as the model was trained on scaled data. Using an sklearn scaler makes this super easy as you can “inverse transform” the predictions to get the real price back, or if using my divisor example above, use that same number to regenerate the real prices.

Check out

https://towardsdatascience.com/preprocessing-with-sklearn-a-complete-and-comprehensive-guide-670cb98fcfb9

and

https://datascience.stackexchange.com/questions/39932/feature-scaling-both-training-and-test-data

…for some clarity. A tip, don’t scale the entire dataset before splitting into train/test sets, and don’t scale them independently either. You’ll want to scale the train set first, then use that same scaler to transform the test set.

For example (pseudo code):

scaler = sklearn.RobustScaler() train = scaler.fit_transform(train_data) test = scaler.transform(test_data)

Then when you’re done your predictions and want to get real values back

test = scaler.inverse_transform(test)

or if using a static number like in the example above, just multiply each point by that number again. Simple. Good luck

Holy freak! This was what I have been searching around. Thank you so much Windowshopr! Hahaha I didn't really expect a reply from a thread that was months ago but here you are :D

Thanks again and all the best~ Take care 👍 😃

Hi @windowshopr , I hope you are here, Could you please share your solution to evolution agent overfitting problem?

@ghaffari903 Don’t use it haha I was frustrated by failure there. I’d recommend using one of the other agents, evolution doesn’t work great. The actor critic ones are good 👍

I will like to add also since I completed my project, the evolutionary model always overfit no matter how my team adjusted it. I didnt have time to try out other models.

Do you have any other social network group to add me? @windowshopr do you apply these models to real trading? and how about profitability?

evolutionary model trade

Hi @windowshopr , Do you use technical indicator in your code? Is it possible share your code or help me how to add them? Thanks.

huseinzol05 / Stock-Prediction-Models

Save and load agents #97