lilianweng / stock-rnn

Predict stock market prices using RNN model with multilayer LSTM cells + optional multi-stock embeddings.
https://lilianweng.github.io/lil-log
1.75k stars 659 forks source link

The "Time-reversed" Problem of Some Crawled Data #13

Open MiracleXYZ opened 6 years ago

MiracleXYZ commented 6 years ago

When I was downloading historical data of SP500 from Yahoo Finance ^GSPC, I found that the data was time-reversed, i.e. the latest entries of data were put on the top of the DataFrame. This phenomenon also exists in nearly all of the data in the provided data archive (stock-data-lilianweng.tar.gz) except SP500.csv and _SP500.csv. Now here is the point: we did not sort the data by time to ensure the basic requirement of LSTM model! In data_model.py, Line 25 to 35:

        # Read csv file
        raw_df = pd.read_csv(os.path.join("data", "%s.csv" % stock_sym))

        # Merge into one sequence
        if close_price_only:
            self.raw_seq = raw_df['Close'].tolist()
        else:
            self.raw_seq = [price for tup in raw_df[['Open', 'Close']].values for price in tup]

        self.raw_seq = np.array(self.raw_seq)
        self.train_X, self.train_y, self.test_X, self.test_y = self._prepare_data(self.raw_seq)

We simply extracted the close prices out of the DataFrame without checking the time. Therefore we were using the earliest 10% for test instead of the latest 10%, which is unreasonable.

Maybe we should sort the data by time before extracting the closing prices, or make sure our data is read in a right order / a consistent format.