giotto-ai / giotto-deep

Deep learning made topological.
Other
79 stars 11 forks source link

Template for Preprocessing Pipeline #85

Closed raphaelreinauer closed 2 years ago

raphaelreinauer commented 2 years ago

Description: When dealing with persistence diagrams as input to machine learning models one wants a generic way to deal with data processing.

One such way is creating a Pipeline. This can be done by creating a generic Pipeline implemented with a chain of transformations. With that design, each transformation takes data as input and returns an output. Additional transformations can be easily added by registering them with a register method. Users can easily add new transformations by simply registering them with the pipeline such that the implementation of does not have to change. Furthermore, the pipeline has to be easily storable as a JSON file such that it can be loaded later for inference. This is crucial since the trained model heavily depends on the transformations performed on the input data.

Template implementation:

import json
import jsonpickle
from typing import Callable, Generic, Iterator, List, TypeVar

T = TypeVar('T')

class Pipeline(Generic[T]):
    _transformations: List[Callable[[T], T]]

    def __init__(self, transformations: List[Callable[[T], T]] = None) -> None:
        if transformations is None:
            self._transformations = []
        else:
            self._transformations = transformations

    def register(self, transformation: Callable[[T], T]) -> None:
        self._transformations.append(transformation)

    def __call__(self, data: T) -> T:
        for transformation in self._transformations:
            data = transformation(data)
        return data

    def __len__(self) -> int:
        return len(self._transformations)

    def __getitem__(self, index: int) -> Callable[[T], T]:
        return self._transformations[index]

    def __iter__(self) -> Iterator[Callable[[T], T]]:
        return iter(self._transformations)

    def __repr__(self) -> str:
        return f'Pipeline({self._transformations})'

    def __add__(self, other: 'Pipeline[T]') -> 'Pipeline[T]':
        return Pipeline(self._transformations + other._transformations)

    def save_to_json(self, path: str) -> None:
        with open(path, 'w') as f:
            json_transformation = jsonpickle.encode(self)
            json.dump(json_transformation, f)

    @classmethod
    def load_from_json(cls, path: str) -> 'Pipeline[T]':
        with open(path, 'r') as f:
            transformations = json.load(f)
        transform: Pipeline[T] = jsonpickle.decode(transformations)
        return transform

def pipeline(*transformations: Callable[[T], T]) -> Pipeline[T]:
    """
    Creates a pipeline that applies the given transformations in order.
    """
    pipeline: Pipeline[T] = Pipeline()
    for transformation in transformations:
        pipeline.register(transformation)
    return pipeline

Sample usage:

import torch
Tensor = torch.Tensor

def add_one(x: Tensor) -> Tensor:
    return x + 1
def multiply_by_two(x: Tensor) -> Tensor:
    return x * 2

pipe: Pipeline[Tensor] = pipeline(add_one, multiply_by_two)

pipe.save_to_json('pipeline.json')
del pipe
pipe = Pipeline.load_from_json('pipeline.json')

x = torch.tensor([[1, 2], [3, 4]])
y = pipe(x)
matteocao commented 2 years ago

The idea is very relevant, however I think the __call__ method would only work for very simple transformations: let me give you an example. What follow is because I have been working on #79 and I have a design in mind I want to present you: the main issue is that transformations cannot be functions (like in your design) but have to be classes. This is because we need, for many of them, to calibrate certain parameters BEFORE we can actually tnasform the output (e.g. computing mean and stddev for the normalisation transformation). I think the ideas are compatible though: I see two options:

  1. we actually implement the __matmul__ method as a composition of such transformation, i.e. we compose the different class methods according to the rules we define. However, may not be possibile at the code level
  2. We build a PreprocessingPipeline class similar to what you propose but can deal with more complex types of transformation described by classes, in which the different transformations have been composed. See, for example, this class . An example:
class PreprocessingPipeline(AbstractPreprocessing, FeatureExtractionMixin):
    """class to compose preprocessing classes"""
    def __init__(self, list_of_cls):
        self.list_of_cls = list_of_cls

    def dataset_level_data(self, data:
        for cls in self.list_of_cls:
            cls.dataset_level_data(data)
            data = cls.batch_transform(data)

    def batch_transform(self, batch):
        for cls in self.list_of_cls:
            batch = cls.batch_transform(batch)

Besides this, the other methods are all reasonable! I will come up with a complete version

raphaelreinauer commented 2 years ago

The idea is very relevant, however I think the call method would only work for very simple transformations: let me give you an example. What follow is because I have been working on https://github.com/giotto-ai/giotto-deep/issues/79 and I have a design in mind I want to present you: the main issue is that transformations cannot be functions (like in your design) but have to be classes. This is because we need, for many of them, to calibrate certain parameters BEFORE we can actually tnasform the output (e.g. computing mean and stddev for the normalisation transformation).

They can also be classes implementing the call method. This works because of duck typing.

matteocao commented 2 years ago

The point is that you have more than one method you need to call in different occasions. One of the methods you need to call can be cast into the __call__, but there is not just one.

raphaelreinauer commented 2 years ago

My suggestion would be to have a Transform base class implementing fit_to_dataset and __call__ for transforming both a single persistence diagram and a batch of persistence diagrams (either as a list or a nested list) and have the actual transformations subclass it. This would make the design very easily extensible.

The pipeline class can concatenate a list of Transform objects into a single pipeline which is the same as concatenating a list of callables.

So this would be the interface:

T = TypeVar('T', PersistenceDiagram, PersistenceDiagramBatch)

Transform:
    def fit_to_dataset(self, diags: PersistenceDiagramDataset) -> None
    def __call__(self, diag: T) -> T

Pipeline:
    def __init__(self, transforms: Optional[List[Transform]] = None) -> None
    def register(self, transform: Transform) -> None
    def __call__(self, diag: T) -> T
    def __len__(self) -> int
    def save_to_json(self, path: str) -> None
    @classmethod
    def load_from_json(cls, path: str) -> Pipeline

see: https://docs.python.org/3/library/typing.html#typing.TypeVar

raphaelreinauer commented 2 years ago

The point is that you have more than one method you need to call in different occasions. One of the methods you need to call can be cast into the __call__, but there is not just one.

Could you please elaborate on what you mean by "There is not one single method that can be called"? You could do a type check if it's a single diagram or a batch. If we have proper internal types this is not a problem (just using the isinstance method) Also, see my previous answer using TypeVar.

matteocao commented 2 years ago

This second proposal is much closer to what I have already built. In particular, a few more comments:

  1. the save_to_json and load_from_json are not associated to the pipeline but to the transform, as transforms can also be called by themselves
  2. pipeline is also a Transform: hence it saves and loads but also __call__ and fit_to_data . you cannot expect the Pipeline to not have the fit_to_data because most often you want to fit on preprocessing on the output of the previous one! For example, imagine a pipeline of transformation like
word embedding --> normalisation

the normalisation has to be fit (i.e. the mean and stddev are to be computed) on the output of thee embeddings!! Before that, when the data is just made of strings, having a normalisation does not really make sense.

  1. the two methods I mentioned are indeed a fit and a transform: here you call them fit_to_data and __call__, I was calling them dataset_level_data and batch_transform. I am fine with the new naming :) So I propose a base class for the transform, that pipeline inherits from the transform and -- which is not present in this issue -- have the transforms be associated and integrated in data types (this last part is a more general discussion).
raphaelreinauer commented 2 years ago

ad 1. I think it would make sense to also have the save/load-functionality for the Pipeline that would internally save/load each Transform in the pipeline. Or is the intention, that there should be a way of just saving a pipeline configuration (ie. creating json- representations of arguments, etc.) to easily recreate it later?

ad 2. A Pipeline could also implement the fit/call-functionality to add extra convenience. However, having the same type for both "normal" Transforms and Pipelines leads to ambiguities, which might be misunderstood. In my opinion it would make more sense to keep the different types - Transforms should be used as they are meant to be and a Pipeline can already be implemented as almost the same thing, but with a different interface (a pipeline specific that still supports the fit/call-functionality). Fitting a datset to a whole pipeline makes a lot of sense to me.

ad 3. This is very similar to my point "ad 2." - clarity and being precise prevents us from making a lot of mistakes. Therefore I suggest to have fixed private names for both fit and transform functions in a base Transform class, to be called exactly like this so we can avoid any mix-ups and which allows us to write robust code. Both pipeline and other kind of transforms would then work with this one private interface. Then we could have some type save conversion functions from the public to the private interface that do the actual conversion and are the one places where we worry about these last details so that it does not contain any ambiguities or stupid mistakes. We would have a kind of "input layer" to transform things from more messy user input (like different names for the same thing) to something that we can work towards. Having a constant for the Transform class private interface would lead us to less mistakes and easy production of robust code. (see also issue "Create a persistence diagram type #84")

matteocao commented 2 years ago

thanks for the answer @raphaelreinauer !

I think pipelines should be transformations because I have in mind the Sklearn paradigm, in which the pipelines also have the same methods of a sklearn transformer: this entails that you can work with either (transforms or pipelines) without changing anything. Than, I would not implement __add__ to a transform but to pipelines only. That's why I propose inherintance. Once we agree on the way to proceed, then 1 and 2 above are solved. for 3 I also meant to always keep the naming of those methods fixed for every transform! I was for sure not suggesting to have different classes with different method names. Hence, we shall simply agree on a naming convention.

matteocao commented 2 years ago

This one is solved. See: https://github.com/giotto-ai/giotto-deep/blob/74b08f81da9dedea12dab01edd5f2dc77e1c7fa0/gdeep/data/preprocessing_pipes.py#L65