Closed raphaelreinauer closed 2 years ago
The idea is very relevant, however I think the __call__
method would only work for very simple transformations: let me give you an example.
What follow is because I have been working on #79 and I have a design in mind I want to present you: the main issue is that transformations cannot be functions (like in your design) but have to be classes. This is because we need, for many of them, to calibrate certain parameters BEFORE we can actually tnasform the output (e.g. computing mean and stddev for the normalisation transformation).
I think the ideas are compatible though: I see two options:
__matmul__
method as a composition of such transformation, i.e. we compose the different class methods according to the rules we define. However, may not be possibile at the code levelPreprocessingPipeline
class similar to what you propose but can deal with more complex types of transformation described by classes, in which the different transformations have been composed. See, for example, this class . An example:class PreprocessingPipeline(AbstractPreprocessing, FeatureExtractionMixin):
"""class to compose preprocessing classes"""
def __init__(self, list_of_cls):
self.list_of_cls = list_of_cls
def dataset_level_data(self, data:
for cls in self.list_of_cls:
cls.dataset_level_data(data)
data = cls.batch_transform(data)
def batch_transform(self, batch):
for cls in self.list_of_cls:
batch = cls.batch_transform(batch)
Besides this, the other methods are all reasonable! I will come up with a complete version
The idea is very relevant, however I think the call method would only work for very simple transformations: let me give you an example. What follow is because I have been working on https://github.com/giotto-ai/giotto-deep/issues/79 and I have a design in mind I want to present you: the main issue is that transformations cannot be functions (like in your design) but have to be classes. This is because we need, for many of them, to calibrate certain parameters BEFORE we can actually tnasform the output (e.g. computing mean and stddev for the normalisation transformation).
They can also be classes implementing the call method. This works because of duck typing.
The point is that you have more than one method you need to call in different occasions.
One of the methods you need to call can be cast into the __call__
, but there is not just one.
My suggestion would be to have a Transform base class implementing fit_to_dataset
and __call__
for transforming both a single persistence diagram and a batch of persistence diagrams (either as a list or a nested list) and have the actual transformations subclass it. This would make the design very easily extensible.
The pipeline class can concatenate a list of Transform objects into a single pipeline which is the same as concatenating a list of callables.
So this would be the interface:
T = TypeVar('T', PersistenceDiagram, PersistenceDiagramBatch)
Transform:
def fit_to_dataset(self, diags: PersistenceDiagramDataset) -> None
def __call__(self, diag: T) -> T
Pipeline:
def __init__(self, transforms: Optional[List[Transform]] = None) -> None
def register(self, transform: Transform) -> None
def __call__(self, diag: T) -> T
def __len__(self) -> int
def save_to_json(self, path: str) -> None
@classmethod
def load_from_json(cls, path: str) -> Pipeline
see: https://docs.python.org/3/library/typing.html#typing.TypeVar
The point is that you have more than one method you need to call in different occasions. One of the methods you need to call can be cast into the
__call__
, but there is not just one.
Could you please elaborate on what you mean by "There is not one single method that can be called"?
You could do a type check if it's a single diagram or a batch. If we have proper internal types this is not a problem (just using the isinstance
method)
Also, see my previous answer using TypeVar.
This second proposal is much closer to what I have already built. In particular, a few more comments:
__call__
and fit_to_data
. you cannot expect the Pipeline
to not have the fit_to_data
because most often you want to fit on preprocessing on the output of the previous one! For example, imagine a pipeline of transformation like word embedding --> normalisation
the normalisation has to be fit (i.e. the mean and stddev are to be computed) on the output of thee embeddings!! Before that, when the data is just made of strings, having a normalisation does not really make sense.
fit_to_data
and __call__
, I was calling them dataset_level_data
and batch_transform
. I am fine with the new naming :)
So I propose a base class for the transform, that pipeline inherits from the transform and -- which is not present in this issue -- have the transforms be associated and integrated in data types (this last part is a more general discussion).ad 1. I think it would make sense to also have the save/load-functionality for the Pipeline that would internally save/load each Transform in the pipeline. Or is the intention, that there should be a way of just saving a pipeline configuration (ie. creating json- representations of arguments, etc.) to easily recreate it later?
ad 2. A Pipeline could also implement the fit/call-functionality to add extra convenience. However, having the same type for both "normal" Transforms and Pipelines leads to ambiguities, which might be misunderstood. In my opinion it would make more sense to keep the different types - Transforms should be used as they are meant to be and a Pipeline can already be implemented as almost the same thing, but with a different interface (a pipeline specific that still supports the fit/call-functionality). Fitting a datset to a whole pipeline makes a lot of sense to me.
ad 3. This is very similar to my point "ad 2." - clarity and being precise prevents us from making a lot of mistakes. Therefore I suggest to have fixed private names for both fit and transform functions in a base Transform class, to be called exactly like this so we can avoid any mix-ups and which allows us to write robust code. Both pipeline and other kind of transforms would then work with this one private interface. Then we could have some type save conversion functions from the public to the private interface that do the actual conversion and are the one places where we worry about these last details so that it does not contain any ambiguities or stupid mistakes. We would have a kind of "input layer" to transform things from more messy user input (like different names for the same thing) to something that we can work towards. Having a constant for the Transform class private interface would lead us to less mistakes and easy production of robust code. (see also issue "Create a persistence diagram type #84")
thanks for the answer @raphaelreinauer !
I think pipelines should be transformations because I have in mind the Sklearn paradigm, in which the pipelines also have the same methods of a sklearn transformer: this entails that you can work with either (transforms or pipelines) without changing anything. Than, I would not implement __add__
to a transform but to pipelines only. That's why I propose inherintance. Once we agree on the way to proceed, then 1 and 2 above are solved.
for 3 I also meant to always keep the naming of those methods fixed for every transform! I was for sure not suggesting to have different classes with different method names. Hence, we shall simply agree on a naming convention.
Description: When dealing with persistence diagrams as input to machine learning models one wants a generic way to deal with data processing.
One such way is creating a Pipeline. This can be done by creating a generic Pipeline implemented with a chain of transformations. With that design, each transformation takes data as input and returns an output. Additional transformations can be easily added by registering them with a register method. Users can easily add new transformations by simply registering them with the pipeline such that the implementation of does not have to change. Furthermore, the pipeline has to be easily storable as a JSON file such that it can be loaded later for inference. This is crucial since the trained model heavily depends on the transformations performed on the input data.
Template implementation:
Sample usage: