Open bigmisspanda opened 2 months ago
Hi @bigmisspanda – thank you for your question!
You are right, all the preprocessing primitives require to be in memory.
One work around can be to replace these primitives with your own scalable functions and then start the Orion pipeline from the modeling primitive directly. Another can be to chunk up your training data and training the pipeline on each chunk.
Hi @bigmisspanda – thank you for your question!
You are right, all the preprocessing primitives require to be in memory.
One work around can be to replace these primitives with your own scalable functions and then start the Orion pipeline from the modeling primitive directly. Another can be to chunk up your training data and training the pipeline on each chunk.
Yes, thank you for your help. I understand what you mean. My plan is to use TadGAN
to train an anomaly detection model. My data comes from power equipment sensors and has over 20 features. If I train in chunks, the MinMaxScaler
results will not be globally distributed. I referred to the information in this document,
and my plan is:
partial_fit
for global calculations on the dataset in advanceMinMaxScaler
from the third step in the primitivesIs my approach feasible? Can TadGAN
perform similar partial_fit training from a continuous stream data?
Your plan looks logical to me!
I'm not too familiar with what partial_fit
does under the hood, however, calling fit
on multiple times on different data chunks seems analogous to their concept of "incremental learning".
Your plan looks logical to me!
I'm not too familiar with what
partial_fit
does under the hood, however, callingfit
on multiple times on different data chunks seems analogous to their concept of "incremental learning".
The concept of partial_fit
is consistent with incremental learning. I will follow this approach for testing and training. Thank you for your great work!
Description
In my case, the training data is very large and cannot be loaded into memory all at once. It seems that
time_segments_aggregate
,SimpleImputer
,MinMaxScaler
, androlling_window_sequences
in the pipeline all require the data to be stored in memory. Can Orion handle training a 2-10TB dataset?