Generic preprocessing module

matteocao commented 2 years ago

Is your feature request related to a problem? Please describe.

The pain is that, most often, plain datasets are not in the right input format or do not have the designed statistical caracterisrtics. Furthermore, standard techniques like data augmentation, need to be implemented Describe the solution you'd like

We build an API class (AbstractClass ) for the preprocessing -- a generic one.

It should look similar to this one:

from abc import ABC, abstractmethod

class AbstractPreprocessing(ABC):
    """The abstract class to define the interface of preprocessing
    """
    @abstractmethod
    def __call__(self, *args, **kwargs):
        """This method deals with datum-wise transformations"""
        pass

    @abstractmethod
    def fit_to_data(self, *args, **kwargs):
        """This method deals with getting dataset-level information"""
        pass

Each of the methods shall be implemented, as it will be called automatically inside the Dataset classes:

the output of __getitem__ will be transformed by item_transform. the data inside item_transform that are needed to perform the transformation, will be stored in self. The methods dataset_level_data and batch_level_data will be called only once, before the first time that __getitem__ is called.
the big advantage of this "on the fly" approach is that it will save a lot of memory -- and hopefully the transformations are not to heavy to compute
In case the transformations are computationally very heavy, then it would be advisable first to transform all the data (not via this class! just create a new dataset) and then use that for the next steps.

Describe alternatives you've considered

Only doing point 3 above (without 1 and 2), however I find it is always possible to only use that approach and it is much easier to implement and is less bind to the generic pipeline

Additional context

raphaelreinauer commented 2 years ago

Thanks, Matteo, for the suggestion. I think I understand what the dataset_level_data is for. But I'm not sure what batch_level_data is for. Could you give an example of what would go in batch_level_data?

When you train a model with preprocessed data, the transformations you apply are crucial for the model; hence you want to have a way of storing them and loading them later on. This is especially important when you want to deploy your model in a production environment, where you will not have access to the preprocessing transformations. This can be quickly done by inheriting from the Huggingface feature extractor mixin class and adding a https://huggingface.co/docs/transformers/main_classes/feature_extractor. Another advantage of inheriting from that class is that developers are already familiar with it, and they will be able to understand your code more easily. It would be great to be compatible with the Huggingface API since developers can efficiently other models from the library and plug them into your framework.

raphaelreinauer commented 2 years ago

The preprocessing transforms in gtda.diagrams.preprocessing don't allow for normalizing the data as well as filtering out the k-most persistent points. I also looked at the implementation of the filtration by thresholding in gtda.homology._utils https://github.com/giotto-ai/giotto-tda/blob/8d09a39403ca11b50605bf466c1aa9f4f3876e5f/gtda/diagrams/_utils.py#L80 and it seems like their implementation does not work for extended persistence diagrams and one-hot encoded homology dimensions. I also don't understand the implementation; it looks much more complicated than it needs to be.

giotto-ai / giotto-deep

Generic preprocessing module #79