idptools / parrot

Python package for protein sequence-based bidirectional recurrent neural network. Generalizable to a variety of protein bioinformatic applications.
MIT License
16 stars 2 forks source link

Multi-GPU support in parrot #11

Closed jlotthammer closed 8 months ago

jlotthammer commented 2 years ago

Starting a conversation about multi-GPU support for parrot. It seems we've sorta done this in the past, but it was never formally implemented into parrot. I've created a branch denoted parrot-parallel to address this. There are a number of different ways we can go about approaching this, but I'm going to suggest the easier of two options, but I suppose we should discuss below.

They both functionally seek to parallelize training, but there are some differences between them. DataParallel is the simpler of the two options, and personally I think is the route that we should probably go because inventing work for ourselves probably isn't that wise. In contrast, DistributedDataParallel seems to be a larger engineering effort, but is the recommended way to support multi-gpu training and does have wider support - e.g., it's aptly named and support distributed training, is faster than DataParallel, etc.

There is a caveat that we should be cognizant of here that is specifically in regard to DataParallel for RNNs. I'd imagine we'd have to make this modification since Parrot is a rnn, but tbh I'm not 100% sure that it applies to all rnns or if there are conditions where it doesn't apply, because I'd thought that @ryanemenecker had done some parallel draining before. Regardless, it'd be useful if you (@degriffith) can take a glance at some point when you get a chance.

effectively what I'm proposing is actually very little code changes - I think I've done it (albeit untested) on the parrot-parallel branch (if we wanna go down this route we can submit a PR from there).

e.g.,

parser.add_argument('--gpu-ids', nargs='+', dest='gpu_ids',type=int,default=None,
                    help='List of GPU IDs to train the network')

gpu_ids = args.gpu_ids

### 1,2, skip a few lines of code ###

if dtype == 'sequence':
    if gpu_ids:
        # Use a many-to-one architecture
        brnn_network = brnn_architecture.BRNN_MtO(input_size, hidden_size,
                                                num_layers, num_classes, device)
        brnn_network = nn.DataParallel(brnn_network,device_ids=gpu_ids) # user specified list of GPU ids
        brnn_network.to(device)
    else: 
        # Use a many-to-one architecture
        brnn_network = brnn_architecture.BRNN_MtO(input_size, hidden_size,
                                                num_layers, num_classes, device)
        brnn_network = nn.DataParallel(brnn_network) # defaults to all available GPUs
        brnn_network.to(device)

Let me know if anyone has thoughts! There's probably some other ways we could do this, but given parrot-train is the de facto way of training I figure the simplest is probably best.

degriffith commented 2 years ago

In our previous work training on a multi-gpu lab machine we used DataParallel and that seemed to work fairly well. I agree that we'll have to look out for some pitfalls since, as you point out, RNNs are inherently a bit finicky with parallelization. In the current implementation sequences are manually padded to achieve the same length, but we'll have to make sure that this is done with making sure the max-length is derived from the complete dataset and not just the individual batch.

In terms of implementation, I think it would be better if we didn't require the user to manually specify the IDs of the GPUs they wanted to use, as this might not be the easiest thing to figure out for a naive user. Instead could we have the command-line argument be something like "--n-gpus" where they just provide a number?

Alternatively we could create a basic helper function/script that returns the IDs/number of available GPUs for training. I'm envisioning some command like: parrot-gpus which writes the available GPU IDs to console. Not sure how much work this would be to implement but something to consider and discuss.

jlotthammer commented 2 years ago

In our previous work training on a multi-gpu lab machine we used DataParallel and that seemed to work fairly well. I agree that we'll have to look out for some pitfalls since, as you point out, RNNs are inherently a bit finicky with parallelization. In the current implementation sequences are manually padded to achieve the same length, but we'll have to make sure that this is done with making sure the max-length is derived from the complete dataset and not just the individual batch.

I haven't looked into why but in the background over the last few days I've retrained a few different versions of the metapredict-v2 using DataParallel and it definitely throws some warnings about very inefficient memory utilization.

NN module weights are not part of single contiguous chunk of memory.
 This means they need to be compacted at every call, possibly greatly increasing memory usage. 
To compact weights again call flatten_parameters(). 
(Triggered internally at  /opt/conda/conda-bld/pytorch_1656352657443/work/aten/src/ATen/native/cudnn/RNN.cpp:968.)
  result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,

In terms of implementation, I think it would be better if we didn't require the user to manually specify the IDs of the GPUs they wanted to use, as this might not be the easiest thing to figure out for a naive user. Instead could we have the command-line argument be something like "--n-gpus" where they just provide a number?

Yeah, so at present, it is configured by default to be all available GPU(s) since the if gpu_ids flag evalutes false by default - so from an ease of use standpoint, the user wouldn't have to specify this unless they wanted fine grain control over precisely which GPUs they want to use. I agree that the --n-gpus is more elegant a solution and the first thing I tried, but I ran into a hypothetical obstacle. I was constructing a list of GPU indices based off the --num-gpus flag in a simple way - i.e., gpu_ids = range(num_gpus), but I realized it reduced flexibility. For instance, consider if someone had 4 GPUs, but 2 of them were being used for another task (rendering, gaming, a different training, being used by someone else on a cluster, mining bitcoin, whatever) with my implementation of num_gpus I don't think it would allocate resources wisely. That is, in my original implementation we'd always be specifying the device ids 0-2 (which may not be advantageous or what's desired - e.g., if the user had 2 RTX 1080s [0,1] and 2 RTX 3090 [2,3] I would've always ended up assigning the two weaker gpus).

Alternatively we could create a basic helper function/script that returns the IDs/number of available GPUs for training. I'm envisioning some command like: parrot-gpus which writes the available GPU IDs to console. Not sure how much work this would be to implement but something to consider and discuss

To your point though, my quick and dirty initial first pass isn't the most elegant and this seems better. There's probably a pythonic way (maybe even in torch?) to access not only how many GPUs you have, but also which GPUs are available? Then we could use this in combination with --num-gpus?

jlotthammer commented 8 months ago

closing issue as technically will be handled by parrot-lightning