Closed RaulPPelaez closed 10 months ago
@AntonioMirarchi please review!
The Custom dataset class is incredibly inefficient. It reloads the whole file every time get()
is called to retrieve a single sample. Some intelligent caching would help. But a much better choice is to use the HDF5 dataset class. It is far more efficient.
In fact, possibly we should just make Custom create a temporary HDF5 file on startup and then load from it.
I really like this idea Peter, thanks! I do not like Custom at all either, but it offers some convenience/simplicity that makes people choose it so I think it is worth improving. I implemented so that the user can instruct Custom to use HDF5 under the hood, lets see how it goes.
h5py are slower than mmap arrays. Maybe I am missing the narrative but why are we doing something different from what we are already doing in other dataloaders?
g
On Mon, Oct 16, 2023 at 11:05 AM Antonio Mirarchi @.***> wrote:
@.**** commented on this pull request.
In torchmdnet/datasets/custom.py https://github.com/torchmd/torchmd-net/pull/235#discussion_r1360353666:
- with h5py.File(hdf5_dataset, "w") as f:
- for i in range(len(files["pos"])):
Create a group for each file
- coord_data = np.load(files["pos"][i])
- embed_data = np.load(files["z"][i]).astype(int)
- group = f.create_group(str(i))
- num_samples = coord_data.shape[0]
- group["pos"] = coord_data
- group["types"] = np.tile(embed_data, (num_samples, 1))
- if "y" in files:
- energy_data = np.load(files["y"][i])
- group["energy"] = energy_data
- if "neg_dy" in files:
- force_data = np.load(files["neg_dy"][i])
- group["forces"] = force_data
It's just the "proper way", in theory you should be able to move across path in the hdf5 file (it's not this case) . For example if you use "create_dataset" you can retrieve the pos dataset using:
f = h5py.File(hdf5_dataset, "r")f["0/pos"]
While if you use the "dictionary way" you get this error: KeyError: "Unable to open object (object 'pos' doesn't exist)"
— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/pull/235#discussion_r1360353666, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOUKJXJDPK7SBVMBLRTX7T2FNANCNFSM6AAAAAA54GA4WI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
This PR exists because Custom was calling np.load() at each call of get(). Even with mmap mode this was really slowing down trainings (it is I/O bound in the end...). I changed it so that: 1- If the dataset is small enough just load it all into RAM 2- Otherwise store the references to the mmap arrays instead of reloading the file each time. Alternatively, I added an option to transform the Custom files to HDF5, which seems to be just a little bit slower than mmap. I went ahead and also implemented the same load to RAM functionality in HDF5.
This is ready. @stefdoerr could you please review?
I think @raimis should review this since he worked on mmaps before. I am not too qualified, so I just commented on style and minor bugs
I would like to merge this one before next release, @raimis could you take a look? Thanks
is this ready to be shipped?
Yes!
we should merge these
While replicating results from the torchmd-protein-thermodynamics repository, I experienced sluggish training speeds and low GPU usage (meaning sitting at 0% and briefly going to 100% at each iteration) using the following configuration file:
The referenced files take approx 300mb. Playing around with num_workers and batch size did not help. Upon investigation, the issue arose from the Custom dataset's I/O-bound get method, which reads from disk every time it is invoked, causing the low GPU usage and slow training.
To resolve this, I implemented a preloading feature that loads the complete dataset into system memory if its size is below a user-configurable threshold, set by default at 1GB. The data is stored as PyTorch tensors, facilitating compatibility with multi-threaded data loaders (num_workers). Notably, this approach does not inflate RAM usage when increasing the number of workers.
This optimization led to a x20 speedup in training time for this specific setup.
OTOH I tweaked the DataLoader options a bit.