Variable-length peaks in hdf5

pluskal-lab / DreaMS

DreaMS (Deep Representations Empowering the Annotation of Mass Spectra)

MIT License

17 stars 5 forks source link

Hi @tornikeo!

I wasn't aware of variable-length arrays when I was constructing the dataset. It would be definitely interesting to experiment with them! However, zero-padding has the advantage of being directly compatible with torch DataLoader. This allows the HDF5 file to be easily fed into a neural network. Additionally, HDF5 effectively compresses the padding zeros, so it doesn't result in much extra memory usage.

By the way, the GeMS dataset is now publicly available on HuggingFace Hub. The GeMS_A10.hdf5 file is a subset that we used for pre-training. I will upload more subsets this week.

pluskal-lab / DreaMS

Variable-length peaks in hdf5 #4