why shuffling data? - Githubissues

ochoch commented 5 years ago

Hello, Nice and interresting work, I learned a lot. During train and testing dataset building process, why are you shuffling data? I though that regarding time serie we should not shuffling data.

data_utils.py

def split_dataset(dataset, ratio=None): size = dataset.size if ratio is None: ratio = _choose_optimal_train_ratio(size)

mask = np.zeros(size, dtype=np.bool_) train_size = int(size * ratio) mask[:train_size] = True np.random.shuffle(mask)

train_x = dataset.x[mask, :] train_y = dataset.y[mask]

mask = np.invert(mask) test_x = dataset.x[mask, :] test_y = dataset.y[mask]

return DataSet(train_x, train_y), DataSet(test_x, test_y)

Regards,

maxim5 commented 5 years ago

Hi @ochoch I think you're right. At that time I thought it was a good idea to shuffle the data, but I now I'd say it leads to overfitting and forward-looking bias.

ochoch commented 5 years ago

Hi Maxim, Thanks for your reply. I played a bit with your implementation and add a provider (FXCM), using pyfxcm ( https://github.com/fxcm/RestAPI/tree/master/fxcmpy).

At the end, as it is time consumming to connect to FXCM servers and they are not delivering the last bar(!), I integrate your python scripts with MT4. On each tick I mn providing the last data (replacement of get_latest_data method), I am providing a csv file, and replace raw_df dataframe with a read_csv method. Then I run predict.py and get prediction for the next bar and draw the result on a chart...

[image: image.png]

At this stage, I am also calculating some accuracy... And to be honest it is quit hard to get some tradable predictions...

I have more or less following accuracy on forward testing :

TF High Accuracy (%) Low Accuracy (%) m15 57.25 56.29 H4 56.25 63.55 D1 65.63 57.29 W1 52.08 58.33 Maybe we should add some additionnal features with selection feature algorithm. Any insights?

Regards,

och

Le sam. 20 avr. 2019 à 11:19, Maxim Podkolzine notifications@github.com a écrit :

Hi @ochoch https://github.com/ochoch I think you're right. At that time I thought it was a good idea to shuffle the data, but I now I'd say it leads to overfitting and forward-looking bias.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxim5/time-series-machine-learning/issues/7#issuecomment-485076384, or mute the thread https://github.com/notifications/unsubscribe-auth/ABTHQD4XEF6YQBYEWAKYTLTPRLNZ7ANCNFSM4HHJFWYA .

maxim5 commented 5 years ago

Hi @ochoch sorry for the delay.

Unfortunately that's the way it is: there is so much noise and so little signal in financial data. If you are able to find a reliable signal more than 50% accurate, it's good enough and you can make money.

In terms of features: that's the key question. All ML algorithms that make money boil down to features. I haven't worked much on crypto data since then. Do you have any ideas in mind?

maxim5 / time-series-machine-learning

why shuffling data? #7