tristandeleu / pytorch-maml

An Implementation of Model-Agnostic Meta-Learning in PyTorch with Torchmeta
MIT License
228 stars 42 forks source link

train.py doesn't use GPU #14

Open agilebean opened 3 years ago

agilebean commented 3 years ago

I am running train.py successfully on my local machine (Macbook Pro 16). Yet in Google Colabo, it seems to take an endless time so start the first epoch (the empty progressbar is shown).

I verified that CUDA is available:

import torch
print('Torch', torch.__version__, 'CUDA', torch.version.cuda)
print('Device:', torch.device('cuda:0'))

Torch 1.5.1 CUDA 10.2
Device: cuda:0

The script starts as:

!python train.py /content/sync/data \
--dataset miniimagenet \
--num-ways 5 \
--num-shots 1 \
--step-size 0.2 \
--batch-size 2 \
--num-batches 8 \
--num-epochs 50 \
--num-workers 2 \
--output-folder /content/sync/output \
--use-cuda \
--verbose

and gives:

2020-12-28 14:36:17.261821: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
DEBUG:root:Creating folder `/content/sync/output/2020-12-28_143618`
INFO:root:Saving configuration file in `/content/sync/output/2020-12-28_143618/config.json`
Training:  25% 2/8 [05:55<24:10, 241.79s/it, accuracy=0.1867, loss=35.6492]

So this makes 242s or 4min per iteration, whereas the same configuration and identical code on my Macbook Pro without GPU takes only about 2.4s per iteration - factor 100:

INFO:root:Saving configuration file in `/Users/chaehan/Google Drive/04 Publishing/18 Metalearning >ICAIIC/pytorch-maml/output/2020-12-28_144912/config.json`
Training:  25%|██▌       | 2/8 [00:05<00:14,  2.43s/it, accuracy=0.1867, loss=38.5514]

Why is this is the case?

A hint is that the GPU is not used, as Colabo shows a popup window after some minutes saying: Warning: you are connected to a GPU runtime, but not utilizing the GPU. Change to a standard runtime In this case, is 16 CPU cores on Macbook vs. 2 in Google Colabo. This doesn't account for the factor 100 between them but might be a hint.

I am convinced that this code must be super fast when running on an NVIDIA P100. So I would be very grateful for any hints!

tristandeleu commented 3 years ago

The script checks if torch.cuda.is_available() is True (in addition to --use-cuda), and defaults back to CPU otherwise. It possibly means that Colab doesn't forward to the script that you have CUDA available: it is available in your notebook (since you can do torch.cuda.version and torch.device in the notebook I imagine), but isn't available in the script.

I don't know much about how Colab with external scripts works. But something you can try, to validate this, is to create a small script like

import torch

print(f'Torch available: {torch.cuda.is_available()}')

and call it with the !python magic command, to ensure that it has access to GPU.

agilebean commented 3 years ago

You were right! But very strange indeed!

I didn't execute it as external script, as Google Colabo works with jupyter notebooks in interactive mode only. As I need it in a notebook cell, with

import torch
print('Torch', torch.__version__, 'CUDA', torch.version.cuda)
print('Device:', torch.device('cuda:0'))
print(f'CUDA available: {torch.cuda.is_available()}')

I got

Torch 1.5.1 CUDA 10.2
Device: cuda:0
CUDA available: False

It still doesn't solve my problem, but you found the root cause. It has nothing to do with your code, but more with torch making CUDA available. It is strange, as the cuda version and cuda device is found.

Will keep you updated if I find a solution.

agilebean commented 3 years ago

Update: The problem was that the torch version 1.5.1 came with CUDA 10.2, yet the NVIDIA driver on Google Colabo has currently CUDA 10.1. For anyone who wants to run the code in Google Colabo: You must install torch 1.6 to match the torch CUDA driver to the Google Colabo-installed CUDA driver with pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Verify the correct CUDA version which pytorch can see by: print('Torch', torch.__version__, 'CUDA', torch.version.cuda)

@tristandeleu Now that I have train.py running with CUDA, I still don't see a lot of speedup compared to my Macbook, even with an NVIDIA V100. Even batch-size 2, num-batches 2, training takes 128s (Macbook: 10s).

Do you have any idea why this is still so slow? Are there any time benchmarks you can share? I could share the Colabo notebook if you want.

tristandeleu commented 3 years ago

I don't have any formal benchmark for how fast this should be unfortunately. But looking at some logs, I was getting one epoch done in 92s on a Titan Xp (this includes 100 batches of training, and 250 batches of validation per epoch -- among these 92s, validation was taking about 35s) for MiniImagenet 5-way 5-shot with the settings from the paper (--num-steps 5 --step-size 0.01 --batch-size 4 --num-batches 100 --hidden-size 32) and 8 workers for data loading.

Hope this helps!

agilebean commented 3 years ago

This definitely helps as reference. Will do more tests and report back. On 31. Dec 2020, 20:43 +0900, Tristan Deleu notifications@github.com, wrote:

I don't have any formal benchmark for how fast this should be unfortunately. But looking at some logs, I was getting one epoch done in 92s on a Titan Xp (this includes 100 batches of training, and 250 batches of validation per epoch -- among these 92s, validation was taking about 35s) for MiniImagenet 5-way 5-shot with the settings from the paper (--num-steps 5 --step-size 0.01 --batch-size 4 --num-batches 100 --hidden-size 32) and 8 workers for data loading. Hope this helps! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

agilebean commented 3 years ago

It seems that the problem is solved although I cannot reproduce the reason. All I did was restart Google Colabo, and install a few extensions. It seems that the availability of the GPU on Colabo fluctuates, as processing time can varies between 1.2 - 8.2it/s in one session!

The benchmark for your run:

reproduce Deleu paper

!python train.py /content/sync/data \
--dataset miniimagenet \
--num-ways 5 \
--num-shots 5 \
--num-steps 5 \
--step-size 0.1 \
--batch-size 4 \
--num-batches 100 \
--num-epochs 1 \
--hidden-size 32 \
--num-workers 8 \
--output-folder /content/sync/output \
--use-cuda \
--verbose

ran smoothly in 80s per epoch using an NVIDIA V100 with

Only one detail was missing to reproduce your experiment - Can you tell me how to specify your configuration of 250 batches of validation per epoch?

agilebean commented 3 years ago

Good news! I finally found the root cause of the huge time delay: My script read and wrote to folders which were synced to Google Drive. "Synced" meant actually "being synced" - and that cause the bottleneck. When I changed the folders to ones directly on the Google Colabo VM, the time per epoch went down from 240s to 0.2s in the previous test! And as mentioned before, running your benchmark from your paper took about 82 seconds to complete per epoch. To become more precise in this, I would still be grateful if you could tell me how to configure the 250 batches of validation per epoch!

tristandeleu commented 3 years ago

That is good news! I/O can be a really big factor.

The 250 batches of validation per epoch is something I have in an internal version of the code: it comes from the fact that I am using a fixed subset of 1000 tasks from the meta-validation split for the evaluation at each epoch. The code in this repo uses num_batches * batch_size random tasks from the meta-validation split at each epoch (they are different at every epoch). Using a fixed subset of tasks for evaluation is a better option, much closer to the way we do validation in standard supervised learning.

I won't be able to push the corresponding code, because it has a number of internal dependencies unfortunately. However I can give you some steps to reproduce it yourself if you want:

def create_indices(dataset, numtasks): indices = set() for in range(num_tasks): indices.add(tuple(random.sample(range(len(dataset.dataset)), dataset.num_classes_per_task))) indices = list(list(x) for x in indices) return indices

dataset = miniimagenet('data', shots=5, ways=5, meta_val=True) indices = create_indices(dataset, 1000) # Sample 1000 tasks

 - Save the indices in a separate file to freeze this subset of indices and reload it later:
```python
import json

with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'w') as f:
    meta_val_indices = json.dump(indices, f)

This is missing a lot of the logic (e.g. how to fetch the correct json file for a specific dataset, shots and ways arguments of get_benchmark_by_name), but I hope this helps!

agilebean commented 3 years ago

This is super helpful, thanks a lot. Coming from psychology, it will be challenging to implement this function flawlessly but I will try. Thanks again! On 4. Jan 2021, 00:09 +0900, Tristan Deleu notifications@github.com, wrote:

That is good news! I/O can be a really big factor. The 250 batches of validation per epoch is something I have in an internal version of the code: it comes from the fact that I am using a fixed subset of 1000 tasks from the meta-validation split for the evaluation at each epoch. The code in this repo uses num_batches * batch_size random tasks from the meta-validation split at each epoch (they are different at every epoch). Using a fixed subset of tasks for evaluation is a better option, much closer to the way we do validation in standard supervised learning. I won't be able to push the corresponding code, because it has a number of internal dependencies unfortunately. However I can give you some steps to reproduce it yourself if you want:

• Create the subset of indices for a specific dataset; I'm taking MiniImagenet 5-way 5-shots as an example. I have a utility function to do that (which is not perfect, because it could return fewer tasks than requested, but that did the trick for me):

import random from torchmeta.datasets.helpers import miniimagenet

def create_indices(dataset, numtasks): indices = set() for in range(num_tasks): indices.add(tuple(random.sample(range(len(dataset.dataset)), dataset.num_classes_per_task))) indices = list(list(x) for x in indices) return indices

dataset = miniimagenet('data', shots=5, ways=5, meta_val=True) indices = create_indices(dataset, 1000) # Sample 1000 tasks

• Save the indices in a separate file to freeze this subset of indices and reload it later:

import json

with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'w') as f: meta_val_indices = json.dump(indices, f)

• In get_benchmark_by_name, use Subset class from PyTorch to only take a subset from meta_val_dataset

Load the indices (e.g. using json)

with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'r') as f:

meta_val_indices = json.load(f)

meta_val_dataset = Subset(meta_val_dataset, meta_val_indices) This is missing a lot of the logic (e.g. how to fetch the correct json file for a specific dataset, shots and ways arguments of get_benchmark_by_name), but I hope this helps! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.