Data loading is slow, needs improvement

The loading has now been improved (speed up linear with respect to batch_size).

The problem with the previous method of loading was that the .h5 files were being indexed every time. The __getitem__(self, index) method in the OxfordPetDataset would query the .h5 files based on the given index.

Because we were using Dataloader(dataset, batch_size=32), the__getitem__ method was being called for every index in the dataset, we were thus querying the .h5 file 2210 times. This is problematic, because the .h5 read bottleneck comes from the number of times you are querying it.

The new solution effectively makes use of .h5 file structure. Instead of querying 2210 times, we query math.ceil(len(dataset)/batch_size) times. So in the __getitem__ method, the index is now a list of size batch_size that effectively queries the data.

An intuitive way to think of why this is faster is as follows:

In the first method, we were searching the indices of .h5 file 2210 times, even though the indices were in order, e.g. 0,1,2
In the second method, we are searches the indices of the .h5 70 times, but for each one we are taking the next 31 items with it (if our batch_size is 32). Because we have already located the start index, if we query [start_index, start_index + batch_size] then we aren't adding any significant time, because no extra search is occurring.

This is achieved by using a custom DataSampling class, as well as BatchSampler.

An important caveat of using this method is that only weak shuffling is allowed. That is, if you have some data [1,2,3,4,5,6,7,8]. A strong shuffle will completely shuffle the data before batching.

A weak shuffle will first batch the data, and then shuffle the batches. There is no element-wise shuffling allowed. So a weak shuffle would be as follows: [1,2,3,4,5,6,7,8] --> [[1,2], [3,4], [5,6], [7,8]] --> [[3,4], [1,2], [7,8], [5,6]]

namiyousef / multi-task-learning

Data loading is slow, needs improvement #14