Closed raphaelreinauer closed 2 years ago
The issue with gtda.diagrams.preprocessing and sklearn.base.TransformerMixin is that they work pretty well when one can load the whole np.ndarray or torch.Tensor in memory, which is not my case here...I have lots of persistent diagrams that I can't fit into my memory. Hence the sklearn paradigm of fit and transform to whole arrays or tensors does not apply here since I have to perform the operation in batches.
Hence I want to use the PyTorch or Huggingface paradigm of preprocessing the data when loading them in a data loader via PyTorch DataLoader. The main goal is to implement a preprocessing module that I can apply to single persistence diagrams as well as to batches, allowing it to be used either in the getitem or collate_fn functions of pytorch.data.dataset.Dataset to create something like (https://huggingface.co/docs/transformers/main_classes/feature_extractor). Furthermore, it should be able to be saved to memory and loaded from memory since the model I train depends on the preprocessing that is applied to test training data. Hence it's crucial for inference to be able to load exactly the same preprocessing parameters in order to apply them to the input data as well.
I hope this answers your question. If not, I can try to be more specific.
@raphaelreinauer can you please provide a quick description of this issue? In particular, what preprocessing do you need and why, for example, you are not using the preprocessing tools of persistent diagrams available in giotto-tda?