sintel-dev / Orion

A machine learning library for detecting anomalies in signals.
https://sintel.dev/Orion/
MIT License
1.03k stars 159 forks source link

Can Orion handle training a 2TB dataset? #567

Open bigmisspanda opened 2 days ago

bigmisspanda commented 2 days ago

Description

In my case, the training data is very large and cannot be loaded into memory all at once. It seems that time_segments_aggregate, SimpleImputer, MinMaxScaler, and rolling_window_sequencesin the pipeline all require the data to be stored in memory. Can Orion handle training a 2-10TB dataset?

sarahmish commented 14 hours ago

Hi @bigmisspanda – thank you for your question!

You are right, all the preprocessing primitives require to be in memory.

One work around can be to replace these primitives with your own scalable functions and then start the Orion pipeline from the modeling primitive directly. Another can be to chunk up your training data and training the pipeline on each chunk.