Preload small datasets to memory in Custom dataset

RaulPPelaez commented 1 year ago

While replicating results from the torchmd-protein-thermodynamics repository, I experienced sluggish training speeds and low GPU usage (meaning sitting at 0% and briefly going to 100% at each iteration) using the following configuration file:

activation: tanh
batch_size: 1024
inference_batch_size: 1024
dataset: Custom
coord_files: "chignolin_ca_coords.npy"
embed_files: "chignolin_ca_embeddings.npy"
force_files: "chignolin_ca_deltaforces.npy"
cutoff_upper: 12.0
cutoff_lower: 3.0
derivative: true
early_stopping_patience: 30
embedding_dimension: 128
lr: 0.0005
lr_factor: 0.8
lr_min: 1.0e-06
lr_patience: 10
lr_warmup_steps: 0
model: graph-network
neighbor_embedding: false
ngpus: -1
num_epochs: 100
num_layers: 4
num_nodes: 1
num_rbf: 18
num_workers: 4
rbf_type: expnorm
save_interval: 2
seed: 1
test_interval: 2
test_size: 0.1
trainable_rbf: true
val_size: 0.05
weight_decay: 0.0

The referenced files take approx 300mb. Playing around with num_workers and batch size did not help. Upon investigation, the issue arose from the Custom dataset's I/O-bound get method, which reads from disk every time it is invoked, causing the low GPU usage and slow training.

To resolve this, I implemented a preloading feature that loads the complete dataset into system memory if its size is below a user-configurable threshold, set by default at 1GB. The data is stored as PyTorch tensors, facilitating compatibility with multi-threaded data loaders (num_workers). Notably, this approach does not inflate RAM usage when increasing the number of workers.

This optimization led to a x20 speedup in training time for this specific setup.

OTOH I tweaked the DataLoader options a bit.

RaulPPelaez commented 1 year ago

@AntonioMirarchi please review!

peastman commented 1 year ago

The Custom dataset class is incredibly inefficient. It reloads the whole file every time get() is called to retrieve a single sample. Some intelligent caching would help. But a much better choice is to use the HDF5 dataset class. It is far more efficient.

peastman commented 1 year ago

In fact, possibly we should just make Custom create a temporary HDF5 file on startup and then load from it.

RaulPPelaez commented 1 year ago

I really like this idea Peter, thanks! I do not like Custom at all either, but it offers some convenience/simplicity that makes people choose it so I think it is worth improving. I implemented so that the user can instruct Custom to use HDF5 under the hood, lets see how it goes.

giadefa commented 1 year ago

h5py are slower than mmap arrays. Maybe I am missing the narrative but why are we doing something different from what we are already doing in other dataloaders?

g

On Mon, Oct 16, 2023 at 11:05 AM Antonio Mirarchi @.***> wrote:

@.**** commented on this pull request.

In torchmdnet/datasets/custom.py https://github.com/torchmd/torchmd-net/pull/235#discussion_r1360353666:

with h5py.File(hdf5_dataset, "w") as f:

for i in range(len(files["pos"])):

Create a group for each file

coord_data = np.load(files["pos"][i])

embed_data = np.load(files["z"][i]).astype(int)

group = f.create_group(str(i))

num_samples = coord_data.shape[0]

group["pos"] = coord_data

group["types"] = np.tile(embed_data, (num_samples, 1))

if "y" in files:

energy_data = np.load(files["y"][i])

group["energy"] = energy_data

if "neg_dy" in files:

force_data = np.load(files["neg_dy"][i])

group["forces"] = force_data

It's just the "proper way", in theory you should be able to move across path in the hdf5 file (it's not this case) . For example if you use "create_dataset" you can retrieve the pos dataset using:

f = h5py.File(hdf5_dataset, "r")f["0/pos"]

While if you use the "dictionary way" you get this error: KeyError: "Unable to open object (object 'pos' doesn't exist)"

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/pull/235#discussion_r1360353666, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOUKJXJDPK7SBVMBLRTX7T2FNANCNFSM6AAAAAA54GA4WI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

RaulPPelaez commented 1 year ago

This PR exists because Custom was calling np.load() at each call of get(). Even with mmap mode this was really slowing down trainings (it is I/O bound in the end...). I changed it so that: 1- If the dataset is small enough just load it all into RAM 2- Otherwise store the references to the mmap arrays instead of reloading the file each time. Alternatively, I added an option to transform the Custom files to HDF5, which seems to be just a little bit slower than mmap. I went ahead and also implemented the same load to RAM functionality in HDF5.

RaulPPelaez commented 1 year ago

This is ready. @stefdoerr could you please review?

stefdoerr commented 1 year ago

I think @raimis should review this since he worked on mmaps before. I am not too qualified, so I just commented on style and minor bugs

RaulPPelaez commented 1 year ago

I would like to merge this one before next release, @raimis could you take a look? Thanks

guillemsimeon commented 11 months ago

is this ready to be shipped?

RaulPPelaez commented 11 months ago

Yes!

guillemsimeon commented 10 months ago

we should merge these

torchmd / torchmd-net

Preload small datasets to memory in Custom dataset #235

@.**** commented on this pull request.

Create a group for each file