namiyousef / multi-task-learning

Repository for multi task learning
2 stars 2 forks source link

Data loading is slow, needs improvement #14

Closed namiyousef closed 2 years ago

namiyousef commented 2 years ago

When using the GPU, the loading of data from .h5 files is really slow, taking up almost 90% of the time per batch.

This would ideally be improved.

Some links added to reference:

namiyousef commented 2 years ago

The loading has now been improved (speed up linear with respect to batch_size).

The problem with the previous method of loading was that the .h5 files were being indexed every time. The __getitem__(self, index) method in the OxfordPetDataset would query the .h5 files based on the given index.

Because we were using Dataloader(dataset, batch_size=32), the__getitem__ method was being called for every index in the dataset, we were thus querying the .h5 file 2210 times. This is problematic, because the .h5 read bottleneck comes from the number of times you are querying it.

The new solution effectively makes use of .h5 file structure. Instead of querying 2210 times, we query math.ceil(len(dataset)/batch_size) times. So in the __getitem__ method, the index is now a list of size batch_size that effectively queries the data.

An intuitive way to think of why this is faster is as follows:

This is achieved by using a custom DataSampling class, as well as BatchSampler.

An important caveat of using this method is that only weak shuffling is allowed. That is, if you have some data [1,2,3,4,5,6,7,8]. A strong shuffle will completely shuffle the data before batching.

A weak shuffle will first batch the data, and then shuffle the batches. There is no element-wise shuffling allowed. So a weak shuffle would be as follows: [1,2,3,4,5,6,7,8] --> [[1,2], [3,4], [5,6], [7,8]] --> [[3,4], [1,2], [7,8], [5,6]]