When I was downloading historical data of SP500 from Yahoo Finance ^GSPC, I found that the data was time-reversed, i.e. the latest entries of data were put on the top of the DataFrame. This phenomenon also exists in nearly all of the data in the provided data archive (stock-data-lilianweng.tar.gz) except SP500.csv and _SP500.csv.
Now here is the point: we did not sort the data by time to ensure the basic requirement of LSTM model! In data_model.py, Line 25 to 35:
# Read csv file
raw_df = pd.read_csv(os.path.join("data", "%s.csv" % stock_sym))
# Merge into one sequence
if close_price_only:
self.raw_seq = raw_df['Close'].tolist()
else:
self.raw_seq = [price for tup in raw_df[['Open', 'Close']].values for price in tup]
self.raw_seq = np.array(self.raw_seq)
self.train_X, self.train_y, self.test_X, self.test_y = self._prepare_data(self.raw_seq)
We simply extracted the close prices out of the DataFrame without checking the time. Therefore we were using the earliest 10% for test instead of the latest 10%, which is unreasonable.
Maybe we should sort the data by time before extracting the closing prices, or make sure our data is read in a right order / a consistent format.
When I was downloading historical data of SP500 from Yahoo Finance ^GSPC, I found that the data was time-reversed, i.e. the latest entries of data were put on the top of the DataFrame. This phenomenon also exists in nearly all of the data in the provided data archive (stock-data-lilianweng.tar.gz) except
SP500.csv
and_SP500.csv
. Now here is the point: we did not sort the data by time to ensure the basic requirement of LSTM model! Indata_model.py
, Line 25 to 35:We simply extracted the close prices out of the DataFrame without checking the time. Therefore we were using the earliest 10% for test instead of the latest 10%, which is unreasonable.
Maybe we should sort the data by time before extracting the closing prices, or make sure our data is read in a right order / a consistent format.