thompson0012 / PyEmits

Sugar candy for data scientist. Easy manipulation in time-series data analytics works.
Apache License 2.0
7 stars 1 forks source link
anomaly-detection automl-pipeline data-science etl etl-pipeline etl-process forecasting framework machine-learning python time-series time-series-analysis

Project Icon

What is Pyemits

PyEmits, a python package for easy manipulation in time-series data.

The ultimate goal:

Keep it simple and stupid

Make everything configurable

Uniform API for machine learning and deep learning

Why need Pyemits?

Time-series data is very common in real life.

Data scientist's work consists of:

each new business unit shall build the following wheels again and again

if you are facing these problems, then Pyemits is fit for you

  1. data processing and ETL pipeline
    1. extraction
    2. transformation
      1. cleansing
      2. feature engineering
      3. remove outliers
      4. AI landing for prediction, forecasting
    3. write it back to database
  2. ml framework
    1. multiple model training
    2. multiple model prediction
    3. kfold validation
    4. anomaly detection
    5. forecasting
    6. develop deep learning model (regression)
    7. ensemble modelling
  3. exploratory data analysis
    1. descriptive data analysis
    2. data profile
    3. data set comparison

data scientist need to write different code to develop their model is there a package integrate all ml lib with single simple api? That's why I create this project.

This project is under active development, free to use (Apache 2.0) I am happy to see anyone can contribute for more advancement on features

New feature:

data processing pipeline

db connection and manipulation

Development Progress

Features Progress Available at version Notes
PyOD integration 80% 0.1.2 model parameters config are not yet finished
XGBoost integration 80% 0.1.2 model parameters config are not yet finished
LightGBM integration 80% 0.1.2 model parameters config are not yet finished
Sklearn model integration 80% 0.1.2 model parameters config are not yet finished
Keras integration 100% 0.1.2
Pytorch_lightning integration 100% 0.1.2
MXnet integration 0% tbc
DB connection 0% tbc
aggregation 0% 0.1.3
cleansing 0% 0.1.3
dimensional reduction 0% 0.1.3
Kalman filtering 0% 0.1.3 or later
model evaluation and visualization 0% 0.1.3 or later
data profile for exploration 20% 0.1.3 or later finished data statistics only
forecast at scale 100% 0.1.2 see preprocessing.scaling.py

Release Update

Version Features Notes
0.1 initialization of project
0.1.1 RegTrainer/ParallelTrainer/KFoldCV
0.1.2 PyOD/Keras/Pytorch_lightning/scaling/splitting

Install

pip install pyemits

Features highlight

scikit-learn API style

inherit the design concept of pyecharts, "everything is configurable"

highly flexible configuration items, can easily integrate with existing model

easily integrate to SaaS product for product proof of concept

Easy training

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel

X = np.random.randint(1, 100, size=(1000, 10))
y = np.random.randint(1, 100, size=(1000, 1))

raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer(['XGBoost'], [None], raw_data_model)
trainer.fit()

Accept neural network as model

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

keras_lstm_model = KerasWrapper.from_simple_lstm_model((10, 10), 4)
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model], [None], raw_data_model)
trainer.fit()

also keep flexibility on customized model

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

from keras.layers import Dense, Dropout, LSTM
from keras import Sequential

model = Sequential()
model.add(LSTM(128,
               activation='softmax',
               input_shape=(10, 10),
               ))
model.add(Dropout(0.1))
model.add(Dense(4))
model.compile(loss='mse', optimizer='adam', metrics=['mse'])

keras_lstm_model = KerasWrapper(model, nickname='LSTM')
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model], [None], raw_data_model)
trainer.fit()

or attach it in algo config

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper
from pyemits.common.config_model import KerasSequentialConfig

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

from keras.layers import Dense, Dropout, LSTM
from keras import Sequential

keras_lstm_model = KerasWrapper(nickname='LSTM')
config = KerasSequentialConfig(layer=[LSTM(128,
                                           activation='softmax',
                                           input_shape=(10, 10),
                                           ),
                                      Dropout(0.1),
                                      Dense(4)],
                               compile=dict(loss='mse', optimizer='adam', metrics=['mse']))

raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model],
                     [config],
                     raw_data_model,
                     {'fit_config': [dict(epochs=10, batch_size=32)]})
trainer.fit()

PyTorch, MXNet under development you can leave me a message if you want to contribute

MultiOutput training

import numpy as np

from pyemits.core.ml.regression.trainer import RegressionDataModel, MultiOutputRegTrainer
from pyemits.core.preprocessing.splitting import SlidingWindowSplitter

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

# when use auto-regressive like MultiOutput, pls set ravel = True
# ravel = False, when you are using LSTM which support multiple dimension
splitter = SlidingWindowSplitter(24, 24, ravel=True)
X, y = splitter.split(X, y)

raw_data_model = RegressionDataModel(X, y)
trainer = MultiOutputRegTrainer(['XGBoost'], [None], raw_data_model)
trainer.fit()

Parallel training

import numpy as np

from pyemits.core.ml.regression.trainer import RegressionDataModel, ParallelRegTrainer

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X, y)
trainer = ParallelRegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

or you can use RegTrainer for multiple model, but it is not in Parallel job

import numpy as np

from pyemits.core.ml.regression.trainer import RegressionDataModel, RegTrainer

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

KFold training

import numpy as np

from pyemits.core.ml.regression.trainer import RegressionDataModel, KFoldCVTrainer
from pyemits.common.config_model import KFoldConfig

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X, y)
trainer = KFoldCVTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model,
                         {'kfold_config': KFoldConfig(n_splits=10)})
trainer.fit()

Easy prediction

import numpy as np
from pyemits.core.ml.regression.trainer import RegressionDataModel, RegTrainer
from pyemits.core.ml.regression.predictor import RegPredictor

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

predictor = RegPredictor(trainer.clf_models, 'RegTrainer')
predictor.predict(RegressionDataModel(X))

Forecast at scale

Data Model

from pyemits.common.data_model import RegressionDataModel
import numpy as np

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 1))

data_model = RegressionDataModel(X, y)

directly write an attribute to the data model

data_model._update_attributes('X_shape', (1000, 10, 10))
data_model.X_shape
>> > (1000, 10, 10)

write something to the meta data

data_model.add_meta_data('dimension', (1000, 10, 10))
data_model.meta_data
>> > {'dimension': (1000, 10, 10)}

Anomaly detection (partial finished)

from pyemits.core.ml.anomaly_detection.predictor import AnomalyPredictor
from pyemits.core.ml.anomaly_detection.trainer import AnomalyTrainer, PyodWrapper
from pyemits.common.data_model import AnomalyDataModel
from pyemits.common.config_model import PyodIforestConfig
from pyod.models.iforest import IForest
from pyod.utils import generate_data

contamination = 0.1  # percentage of outliers
n_train = 1000  # number of training points
n_test = 200  # number of testing points

X_train, y_train, X_test, y_test = generate_data(
    n_train=n_train, n_test=n_test, contamination=contamination)

# highly flexible model config, accept str, PyodWrapper with/without initialized model
# either one
trainer = AnomalyTrainer(['IForest', PyodWrapper(IForest()), PyodWrapper(IForest), 'IForest', 'IForest', 'IForest'],
                         None, AnomalyDataModel(X_train))
trainer = AnomalyTrainer([PyodWrapper(IForest(contamination=0.05)), PyodWrapper(IForest)],
                         [None, PyodIforestConfig(contamination=0.05)], AnomalyDataModel(X_train))
trainer.fit()

# option 1
predictor = AnomalyPredictor(trainer.clf_models)

# option 2
predictor = AnomalyPredictor(trainer.clf_models,
                             other_config={'standard_scaler': predictor.misc_container['standard_scaler']})

# option 3
predictor = AnomalyPredictor(trainer.clf_models,
                             other_config={'standard_scaler': predictor.misc_container['standard_scaler'],
                                           'combination_config': {'n_buckets': 5}}, combination_method='moa')

predictor.predict(AnomalyDataModel(X_test))

Data processing pipeline

it features in the following:

you can embed other function features in the task, but parameter: "data" is required to be passed in

e.g. add email notification, add log, upload to database etc...


from pyemits.core.preprocessing.pipeline import DataNode, NumpyDataNode, PandasDataFrameDataNode, PandasSeriesDataNode,

Pipeline, Step, Task
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.random(size=(20, 20)))

dn = PandasDataFrameDataNode.from_pandas(df)

def sum_each_col(data, a=1, b=2):
    return data.sum()

def sum_series(data):
    return np.array([data.sum()])

task registration and arguments registration

task_a = Task(sum_each_col)
task_a.register_args(a=10, b=10)
task_b = Task(sum_series)

pipeline register step and execute

pipeline = Pipeline()

step_a = Step('step_a', [task_a], '')
step_b = Step('step_b', [task_b], '')

pipeline.register_step(step_a)
pipeline.register_step(step_b)
pipeline.execute(dn)

know the steps and its tasks from start to end

pipeline.get_step_task_mapping()
>> > {0: ('test', ['sum_each_col']), 1: ('test1', ['sum_series'])}

know the snapshot result in each steps, each tasks, friendly to data scientist for debugging

pipeline.get_pipeline_snapshot_res(step_id=1,tasks_id=0)
> > > array([197.70351007])

Evaluation (under development)

Ensemble (under development)

IO

dashboard ???

other miscellaneous feature

to be confirmed

....

References

the following libraries gave me some idea/insight

  1. greykit
    1. changepoint detection
    2. model summary
    3. seaonality
  2. pytorch-forecasting
  3. darts
  4. pyaf
  5. orbit
  6. kats/prophets by facebook
  7. sktime
  8. gluon ts
  9. tslearn
  10. pyts
  11. luminaries
  12. tods
  13. autots
  14. pyodds
  15. scikit-hts