pluskal-lab / DreaMS

DreaMS (Deep Representations Empowering the Annotation of Mass Spectra)
https://dreams-docs.readthedocs.io
MIT License
17 stars 5 forks source link

Variable-length peaks in hdf5 #4

Open tornikeo opened 1 month ago

tornikeo commented 1 month ago

Is there any reason to avoid using variable-length arrays for peaks in hdf5, in favor of zero-padding?

I'm unaware of any performance/disk space effects that might arise from using this, but having an ability to choose the number of peaks after initial DB generation might make working with the data more flexible.

roman-bushuiev commented 1 month ago

Hi @tornikeo!

I wasn't aware of variable-length arrays when I was constructing the dataset. It would be definitely interesting to experiment with them! However, zero-padding has the advantage of being directly compatible with torch DataLoader. This allows the HDF5 file to be easily fed into a neural network. Additionally, HDF5 effectively compresses the padding zeros, so it doesn't result in much extra memory usage.

By the way, the GeMS dataset is now publicly available on HuggingFace Hub. The GeMS_A10.hdf5 file is a subset that we used for pre-training. I will upload more subsets this week.