Closed matteocao closed 2 years ago
Thanks, Matteo, for the suggestion. I think I understand what the dataset_level_data is for. But I'm not sure what batch_level_data is for. Could you give an example of what would go in batch_level_data?
When you train a model with preprocessed data, the transformations you apply are crucial for the model; hence you want to have a way of storing them and loading them later on. This is especially important when you want to deploy your model in a production environment, where you will not have access to the preprocessing transformations. This can be quickly done by inheriting from the Huggingface feature extractor mixin class and adding a https://huggingface.co/docs/transformers/main_classes/feature_extractor. Another advantage of inheriting from that class is that developers are already familiar with it, and they will be able to understand your code more easily. It would be great to be compatible with the Huggingface API since developers can efficiently other models from the library and plug them into your framework.
The preprocessing transforms in gtda.diagrams.preprocessing don't allow for normalizing the data as well as filtering out the k-most persistent points. I also looked at the implementation of the filtration by thresholding in gtda.homology._utils https://github.com/giotto-ai/giotto-tda/blob/8d09a39403ca11b50605bf466c1aa9f4f3876e5f/gtda/diagrams/_utils.py#L80 and it seems like their implementation does not work for extended persistence diagrams and one-hot encoded homology dimensions. I also don't understand the implementation; it looks much more complicated than it needs to be.
Is your feature request related to a problem? Please describe.
The pain is that, most often, plain datasets are not in the right input format or do not have the designed statistical caracterisrtics. Furthermore, standard techniques like data augmentation, need to be implemented Describe the solution you'd like
We build an API class (
AbstractClass
) for the preprocessing -- a generic one.It should look similar to this one:
Each of the methods shall be implemented, as it will be called automatically inside the
Dataset
classes:__getitem__
will be transformed byitem_transform
. the data insideitem_transform
that are needed to perform the transformation, will be stored in self. The methodsdataset_level_data
andbatch_level_data
will be called only once, before the first time that__getitem__
is called.Describe alternatives you've considered
Only doing point 3 above (without 1 and 2), however I find it is always possible to only use that approach and it is much easier to implement and is less bind to the generic pipeline
Additional context