Data source, SMA calculation and Y outputs offsets

desduvauchelle commented 3 years ago

Hello,

First of all, thank you for great tutorial. And sorry for the long post.

Was reading through the code and there are a couple of things I can't understand. And maybe some potential improvement ideas.

Improvements: Data selection

You are using the "price" for your calculations which I believe isn't the best value, you usually want an "Adjusted" (aka Adj) value. When you look at the stock prices for Apple (AAPL) for example, they did a stock split about 1-2 months ago (and end of 2014) and you can see in the data that it has a huge drop in "price" because it was split (ex: 1 share now became 2 shares)...so the results won't be correct. I would advise to use Adj Close or similar. Tesla also recently did a split.

(FYI: I ended up using Yahoo finance to get data that was Adj in my model)

Maybe not correct: Calculation of the SMA

I was looking at your ComputeSMA

function ComputeSMA(data, window_size)
{
  let r_avgs = [], avg_prev = 0;
  for (let i = 0; i <= data.length - window_size; i++){
    let curr_avg = 0.00, t = i + window_size;
    for (let k = i; k < t && k <= data.length; k++){
      curr_avg += data[k]['price'] / window_size;
    }
    r_avgs.push({ set: data.slice(i, i + window_size), avg: curr_avg });
    avg_prev = curr_avg;
  }
  return r_avgs;
}

I might be reading the code wrong but in curr_avg += data[k]['price'] / window_size you are dividing the price by the window_size before adding it to the average. Ex:

const data = [1,2,3,4,5]
const window_size = data.length // => 5
// What it seems you are doing:
const avg = 1/5 + 2/5 + 3/5 + 4/5 + 5/5. // => 2.83
// How average is calculated
const avg2 = (1+2+3+4+5) / 5 // => 3

Maybe issue: Where do you offset your Y results?

As I was reading your ComputeSMA, it seems like the set is the avg. Once again I might be wrong, but here are my thoughts. In your onClickTrainModel, you have

 let inputs = sma_vec.map(function(inp_f){
    return inp_f['set'].map(function(val) { return val['price']; })
  });
  let outputs = sma_vec.map(function(outp_f) { return outp_f['avg']; });

So it is taking the direct output from your ComputeSMA I believe. Now, when we look at your ComputeSMA again, your average is calculated like so:

 let curr_avg = 0.00, t = i + window_size;
    for (let k = i; k < t && k <= data.length; k++){
      curr_avg += data[k]['price'] / window_size;
    }

So it's taking the values from i to t which is i+window_size (so: i to i+window_size) to calculate the avg, and your set is also data.slice(i, i + window_size). So I'm not sure where you are offsetting your Y values for your model.

Question: Model

Any chance you can explain in more details how you build your model? Some examples, are:

How do you decide input_layer_neurons = 100? What does it change?
Same for const rnn_input_layer_features = 10
What does .div(tf.scalar(10)) do on your tensors? Does it normalize the data?

Any help much appreciated.

jinglescode commented 3 years ago

Hi, thanks for checking out the project. Let's discuss each of these points.

Improvements: Data selection

Great suggestion, I have changed it to pull from these APIs:

Looking good!

Maybe not correct: Calculation of the SMA + Maybe issue: Where do you offset your Y results?

I just did some checks by putting console.log to check the data in and out of the ComputeSMA function. It seems correct. Correct me if I'm wrong, you maybe have been confused by the window_size parameter. window_size is the size of the sliding window, not the size of the data. I also changed the window size on the web UI to 5, for easy calculation.

  console.log(11, data_raw, window_size);
  sma_vec = ComputeSMA(data_raw, window_size);
  console.log(22, sma_vec)

For console.log(11, data_raw, window_size), data_raw is:

0: {timestamp: "1999-11-12", price: 89.19}
1: {timestamp: "1999-11-19", price: 86}
2: {timestamp: "1999-11-26", price: 91.12}
3: {timestamp: "1999-12-03", price: 96.12}
4: {timestamp: "1999-12-10", price: 93.87}
5: {timestamp: "1999-12-17", price: 115.25}
6: {timestamp: "1999-12-23", price: 117.44}
7: {timestamp: "1999-12-31", price: 116.75}

For console.log(22, sma_vec), sma_vec is:

0:
avg: 91.26
set: Array(5)
0: {timestamp: "1999-11-12", price: 89.19}
1: {timestamp: "1999-11-19", price: 86}
2: {timestamp: "1999-11-26", price: 91.12}
3: {timestamp: "1999-12-03", price: 96.12}
4: {timestamp: "1999-12-10", price: 93.87}
length: 5

1:
avg: 96.472
set: Array(5)
0: {timestamp: "1999-11-19", price: 86}
1: {timestamp: "1999-11-26", price: 91.12}
2: {timestamp: "1999-12-03", price: 96.12}
3: {timestamp: "1999-12-10", price: 93.87}
4: {timestamp: "1999-12-17", price: 115.25}
length: 5

So what is happening was, it averages the 5 values, (89.19 + 86 + 91.12 + 96.12 + 93.87) / 5 = 91.26. Then, it slide one step, and average the next 5 values, (86 + 91.12 + 96.12 + 93.87 + 115.25) / 5 = 96.472.

Am I correct? Did I answer your question? Or did I make a mistake?

Question: Model

explain in more details how you build your model?, I would need to expand on it on the article. But here are some of the other pointers:

input_layer_neurons = 100: the 100, I simply pluck from thin air, no scientific reasons why it is 100. It is a model parameter which you can tune. This is the parameter for the linear (or dense) layer. Generally, the higher this is, the model can memorize better. Overfitting can be an issue if this is too much though. Maybe 32 is good? maybe 128 can give you a good result for a particular stock.
rnn_input_layer_features = 10: same as input_layer_neurons, but this is the parameter for the RNN
.div(tf.scalar(10)): honestly I cant quite remember what is this for, but it is for to make the tensor size correct

Hope these are useful, we can discuss more.

desduvauchelle commented 3 years ago

Hi Thanks for answering!

Data

Awesome. Glad it worked. I hope it helps.

Average

I just did a test and it does give the average. Apologies for that. For some reason I can't wrap my head around why it's working ha :) I usually use reduce functions for this.

const calculateAverage = (quotes = [89.19,86,91.12]) => {
     return quotes.reduce((total, num) => total + sum) / quotes.length
}

Shifting the Ys

Maybe I need clarify this question. Using your example above for the data:

avg: 91.26
set: Array(5)
0: {timestamp: "1999-11-12", price: 89.19}
1: {timestamp: "1999-11-19", price: 86}
2: {timestamp: "1999-11-26", price: 91.12}
3: {timestamp: "1999-12-03", price: 96.12}
4: {timestamp: "1999-12-10", price: 93.87}

I'm not seeing where you set the future value, right now I think you are saying

const X = [
   {timestamp: "1999-11-12", price: 89.19},
   {timestamp: "1999-11-19", price: 86},
   {timestamp: "1999-11-26", price: 91.12},
   {timestamp: "1999-12-03", price: 96.12},
   {timestamp: "1999-12-10", price: 93.87}
]

const Y = 91.26. // <= The actual average for that period X

If that is correct, it would mean that your model is learning how to calculate an average, not forecast in the future. But I'm probably not seeing/understanding something.

Model

Yes, I think adding it to your article would be great! A whole breakdown of your model.js file would be amazing.

Thanks again.

desduvauchelle commented 3 years ago

oh also, relating to the model, I was thinking the .div(tf.scalar(10)) is to "normalize" the data. If it is, wouldn't it be better to do a soft min/max ? So something along the lines of:

const normalizedInputs = xs.sub(inputMin).div(inputMax.sub(inputMin))
const normalizedOutputs = ys.sub(outputMin).div(outputMax.sub(outputMin))

In plain normal JS, would look like this:

const normalize = (value, min, max) => { 
    if (min === undefined || max === undefined) {
        return value
    }
    return (value - min) / (max - min)
}

jinglescode commented 3 years ago

Oh yes, I got your question on the Shifting the Ys now. I just checked these:

By logging line 97:

sma_vec = ComputeSMA(data_raw, window_size);
console.log(sma_vec)

And looking at line 190:

  console.log('train X', inputs)
  console.log('train Y', outputs)

So yes, you are right that the model is calculating the average, and the aim is to predict the future SMA. So means that predicting the next point is "pointless", but predicting the next 10 points (or how far you wanna go) will be more helpful so you are predicting if it's going up and down next (and by how much). Also using SMA is possibly the easiest one to understand (for learning), that's why it was chosen in the tutorial. Alternatively, we could also, as you have suggested, shift the Y, predict the next SMA instead of the current, so it makes a bit more sense, and not just computing the average.

In short, there are a few better solutions:

predicting the next n points, this will have to change the model to be a sequence to sequence model
shift the y, so the model is predicting the technical analysis indicator future values (can be another indicator), (e.g. using day 1 to 5, to predict SMA day 10 SMA value)

What do you think? Make sense? I would love to see what you have done and hope you can do a PR on the cool things you've done.

brandonculver commented 3 years ago

function ComputeSMA(data, window_size)
{
  let r_avgs = [], avg_prev = 0;
  for (let i = 0; i <= data.length - window_size; i++){
    let curr_avg = 0.00, t = i + window_size;
    for (let k = i; k < t && k <= data.length; k++){
      curr_avg += data[k]['price'] / window_size;
    }
    r_avgs.push({ set: data.slice(i, i + window_size), avg: curr_avg });
    avg_prev = curr_avg;
  }
  return r_avgs;
}

You never actually use avg_prev for anything here.

jinglescode commented 3 years ago

Thanks @brandonculver for highlighting that. That must be a bug. Feel free to reply here if you have fixed it or do a PR.

tomtom94 commented 2 years ago

You may need to have a look to this one https://github.com/tomtom94/stockmarketpredictions

jinglescode / time-series-forecasting-tensorflowjs