Closed namiyousef closed 2 years ago
The loading has now been improved (speed up linear with respect to batch_size
).
The problem with the previous method of loading was that the .h5 files were being indexed every time.
The __getitem__(self, index)
method in the OxfordPetDataset
would query the .h5 files based on the given index.
Because we were using Dataloader(dataset, batch_size=32)
, the__getitem__
method was being called for every index in the dataset, we were thus querying the .h5 file 2210 times. This is problematic, because the .h5 read bottleneck comes from the number of times you are querying it.
The new solution effectively makes use of .h5 file structure. Instead of querying 2210 times, we query math.ceil(len(dataset)/batch_size)
times. So in the __getitem__
method, the index is now a list of size batch_size that effectively queries the data.
An intuitive way to think of why this is faster is as follows:
batch_size
is 32). Because we have already located the start index, if we query [start_index, start_index + batch_size] then we aren't adding any significant time, because no extra search is occurring.This is achieved by using a custom DataSampling
class, as well as BatchSampler
.
An important caveat of using this method is that only weak shuffling is allowed. That is, if you have some data [1,2,3,4,5,6,7,8]
. A strong shuffle will completely shuffle the data before batching.
A weak shuffle will first batch the data, and then shuffle the batches. There is no element-wise shuffling allowed.
So a weak shuffle would be as follows:
[1,2,3,4,5,6,7,8] --> [[1,2], [3,4], [5,6], [7,8]] --> [[3,4], [1,2], [7,8], [5,6]]
When using the GPU, the loading of data from .h5 files is really slow, taking up almost 90% of the time per batch.
This would ideally be improved.
Some links added to reference: