mesnico / RelationNetworks-CLEVR

A pytorch implementation for "A simple neural network module for relational reasoning", working on the CLEVR dataset
MIT License
87 stars 26 forks source link

About the running time and gpu memory usage #7

Closed LinkToPast1990 closed 5 years ago

LinkToPast1990 commented 5 years ago

Hi, @mesnico, could you share the GPU device you used and how long it takes to training this network?

mesnico commented 5 years ago

Hi @LMdeLiangMi, I used 2 Tesla K40 GPUs, for a total of 48Gb of VRAM. It took about 30-40 minutes per epoch. I trained for about 350 epochs before reaching a good convergence for the from-pixels version so in the end my training took about 10 days.

LinkToPast1990 commented 5 years ago

I found that the data loader seems so slow because it does the image processing on CPU. I am trying to write a new loader based on Nvidia Dali.

mesnico commented 5 years ago

@LMdeLiangMi Never heard about Dali, it seems interesting! However, check your CPU utilization. If it is low, I can tell you that, during my experiments, I observed that the disk was very often the bottleneck. Consider moving the CLEVR dataset onto a solid-state drive, if you haven't yet. You should observe a higher utilization of both CPUs and GPUs, together with an overall training speedup.

LinkToPast1990 commented 5 years ago

@mesnico I put the dataset on memory and use Dali, so now it is okay. By the way, could you tell me why label subs 1 in utils.py? label = (label - 1).squeeze(1)

mesnico commented 5 years ago

@LMdeLiangMi I'm glad you solved the problem.

By the way, could you tell me why label subs 1 in utils.py?

You can see that in the function build_dictionaries() I employed the one-based indexing while constructing the dictionaries, both for the questions and the answers. This is basically because the index 0 is usually reserved for padding (the padding is not necessary for the answers, I did so for consistency with the questions dictionary). However, while preparing the data for the network, I need to shift back all the answer indexes, otherwise I would have a useless output neuron corresponding to the dummy index 0.

LinkToPast1990 commented 5 years ago

I see. Thanks.