onnx / sklearn-onnx

Convert scikit-learn models and pipelines to ONNX
Apache License 2.0
548 stars 99 forks source link

Convert ISO 8601 datetime format to a numeric #999

Open robingenz opened 1 year ago

robingenz commented 1 year ago

I am currently working on a model that takes as input, among other data, a string in ISO 8601 datetime format. This string should be converted into a (numeric) timestamp using a converter.

Example:

The sklearn pipeline looks like this:

timestamp_column_indices = [
    'CreatedDate'
]

class TimestampTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X[created_date_column_name] = pd.to_datetime(X[created_date_column_name])
        X[created_date_column_name] = X[created_date_column_name].astype(np.int64) // 10**9
        return X

column_transformer = ColumnTransformer(transformers=[
    ('timestamp', TimestampTransformer(), timestamp_column_indices)
], remainder='passthrough')
classifier = RandomForestClassifier()
clr_pipeline = Pipeline([
    ('column_transformer', column_transformer),
    ('classifier', classifier),
])

(Unnecessary columns have been removed for clarity).

With the help of the TimestampTransformer the string in ISO 8601 datetime format is converted into a timestamp. Unfortunately I get the following error message when exporting the model to ONNX format:

Unable to find a shape calculator for type '<class '__main__.TimestampTransformer'>'.
It usually means the pipeline being converted contains a
transformer or a predictor with no corresponding converter
implemented in sklearn-onnx. If the converted is implemented
in another library, you need to register
the converted so that it can be used by sklearn-onnx (function
update_registered_converter). If the model is not yet covered
by sklearn-onnx, you may raise an issue to
https://github.com/onnx/sklearn-onnx/issues
to get the converter implemented or even contribute to the
project. If the model is a custom model, a new converter must
be implemented. Examples can be found in the gallery.

I understand the problem and have also read through the documentation on how to implement a new converter. Unfortunately I have no idea what is the best way to start. I am very new to the ONNX format and hope someone can give me a hint on how to solve this problem.

xadupre commented 1 year ago

Unfortunately, there is no operator thaking a string and returning a numerical information like you need and no way to do that with the existing op. So you would need to introduce a new operator to onnx. It can be in onnx repository but it needs to be approved by the community. You may need to attend one the SIG meeting: https://github.com/microsoft/onnxruntime-extensions/blob/main/docs/custom_ops.md. It can be a custom operator implemented in python (see onnxruntime-extensions) or in C++ depending on where you need to deploy.

Once it is done, a new converter needs to be registered in sklearn-onnx to convert your custom transformer.

xadupre commented 1 year ago

You should follow this PR https://github.com/onnx/onnx/pull/5417. Once it is merged, it will be part of onnx standard and onnxruntime will implement it.