mahmoodlab / CLAM

Data-efficient and weakly supervised computational pathology on whole slide images - Nature Biomedical Engineering
http://clam.mahmoodlab.org
GNU General Public License v3.0
1.07k stars 351 forks source link

Why isn't batch size an argument for main.py? #9

Closed tmabraham closed 4 years ago

tmabraham commented 4 years ago

Going through the code, I see the batch size isn't an argument for the code. I later found out that the batch size is hard-coded to be 1. This seems like an odd choice. Why can't the batch size be higher? Is there something wrong with using higher batch sizes? Is there instead a challenge of collating the slide bags with variable number of tiles?

Similarly, why is num_workers hard-coded to 4? The optimal value will depend on the workstation.

fedshyvana commented 4 years ago

Yes, batchsize 1 in this case because there's not really a natural way to collate slide bags with variable number of tiles (it's probably possible to implement if you want to first merge multiple slides into a single bag, and keep track of the number of tiles in each bag so that you can unflatten the bags later, but you would also have to find a workaround for that when attention-pooling amongst other things). Another issue is that there's a large variation in bag sizes between slides, for my datasets, bag sizes can be in the 1000s but go up to 100,000s so there's no telling when your batch might explode for batch size >1 during training since bags are sampled randomly. Memory is one issue but i think this could also cause a bottleneck if some batches are much slower than others due to GPUs not being up to process them quickly enough. similarly, feel free to play around with num_workers, I didn't tune that in my case either. I looked in online threads to see if there's a rule of thumb for the best number to use but most people just say "it depends on your dataset, your gpu, cpu, kind of data processing pipeline, etc., and more workers won't necessarily be more efficient" so I just left it as 4 since it seemed people typically used between 2 - 8.

tmabraham commented 4 years ago

@fedshyvana These are some good points. This brings me the question though, how might the speed of the models be improved? I have a fairly large dataset, and it takes 10 minutes per epoch. Running it for 200 epochs takes a really long time. I have access to several TITAN RTXs so I was wondering that increasing the batch size might help. But maybe this is not an option? Right now, nvidia-smi shows very little GPU usage. Do you know how to improve the performance of the code during training?

fedshyvana commented 4 years ago

haha, may I ask what's the size of your dataset? When I trained on a dataset of nearly 20,000 slides for a different project, I remembered training took a little over a day which I thought was acceptable. Note the 200 epochs is the "maximum" that I set in the code by default. But if you use --early_stopping, it'll monitor the validation loss each epoch, and early stop the training if it does not improve in 20 consecutive epochs (currently hardcoded in core_utils.py) and save the best model. Depending on your task, I don't think it will usually take more than 80 epochs to finish training. I've also been making and testing some optimizations to the code during my revision that I can probably share once they're polished.

tmabraham commented 4 years ago

@fedshyvana I have a dataset of ~10,000 slides. I think I was just disappointed that the code isn't even using the GPUs very much. Because I would expect if one has more powerful GPUs it would be faster. Maybe the code is IO-limited? This is even after increasing the num_workers.

fedshyvana commented 4 years ago

IO def plays a role if you are certain that other parts of your hardware are capable enough which sounds like they’re. I noticed that if I load data from my NVME SSD vs using HDD when training on google cloud, the code runs 2-3 times faster. So make sure you use SSD. Another thing to note is that GPU memory usage doesn’t necessarily tell you how hard the GPU is working though. My understanding is that if you run the same code on a 2080TI vs a 5 year old K80 GPU. The K80 actually has more memory and ofc the same tensor will occupy the same amount of memory on both GPUs but the 2080TI will run much faster than the K80 due to various factors like faster/more CUDA cores. But I agree, there is probably room here to speed this up on more powerful hardware.

On Mon, Jul 13, 2020 at 11:09 AM Tanishq Abraham notifications@github.com wrote:

@fedshyvana https://github.com/fedshyvana I have a dataset of ~10,000 slides. I think I was just disappointed that the code isn't even using the GPUs very much. Because I would expect if one has more powerful GPUs it would be faster. Maybe the code is IO-limited? This is even after increasing the num_workers.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/mahmoodlab/CLAM/issues/9#issuecomment-657617211, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJPQA5GOE6IBYQBZVYMV2LDR3MPTXANCNFSM4OYEJBWA .

tmabraham commented 4 years ago

Do you have any intuition or speculation as to what can be improved in the code to speed it up?

fedshyvana commented 4 years ago

@fedshyvana I have a dataset of ~10,000 slides. I think I was just disappointed that the code isn't even using the GPUs very much. Because I would expect if one has more powerful GPUs it would be faster. Maybe the code is IO-limited? This is even after increasing the num_workers.

The codebase is certainly not meant to be optimized for every configuration of hardware. I am not aware of any publically available pipeline that will be able to train on these large scale histology datasets faster than the current performance. After some research, it would appear that Titan RTX is only marginally faster (~8%) compared to a 2080TI. https://lambdalabs.com/blog/titan-rtx-tensorflow-benchmarks/.

@fedshyvana Do you have any intuition or speculation as to what can be improved in the code to speed it up?

Given that the RTX TITAN has a large VRAM capacity, it might benefit from an implementation that uses a higher batch-size, however, besides other potential issues mentioned above, IO might also become a bottleneck and so it might not pan-out.

tmabraham commented 4 years ago

@fedshyvana Thank you for your response. In the future, I would be interested in working to improve the performance of the code. Let me know if this is something you think would be interesting and valuable to work on :)

fedshyvana commented 4 years ago

@fedshyvana Thank you for your response. In the future, I would be interested in working to improve the performance of the code. Let me know if this is something you think would be interesting and valuable to work on :)

Haha, I am also continually on the lookout for ways to optimize the pipeline in my revision and as I apply it to other projects. Maybe it is worth keeping some ideas for now, and for more concrete plans or interesting collaboration ideas, feel free to reach out privately to me or my PI.

tmabraham commented 4 years ago

@fedshyvana Yes, our lab is interested in potential collaboration opportunities and I am in contact with your PI via email, but I do not have your email. Either please share your email or send me an email at tmabraham@ucdavis.edu... Would love to contact you about potential ideas and questions!