Open tornikeo opened 1 month ago
Hi @tornikeo!
I wasn't aware of variable-length arrays when I was constructing the dataset. It would be definitely interesting to experiment with them! However, zero-padding has the advantage of being directly compatible with torch DataLoader. This allows the HDF5 file to be easily fed into a neural network. Additionally, HDF5 effectively compresses the padding zeros, so it doesn't result in much extra memory usage.
By the way, the GeMS dataset is now publicly available on HuggingFace Hub. The GeMS_A10.hdf5
file is a subset that we used for pre-training. I will upload more subsets this week.
Is there any reason to avoid using variable-length arrays for peaks in hdf5, in favor of zero-padding?
I'm unaware of any performance/disk space effects that might arise from using this, but having an ability to choose the number of peaks after initial DB generation might make working with the data more flexible.