Open agilebean opened 3 years ago
The script checks if torch.cuda.is_available()
is True
(in addition to --use-cuda
), and defaults back to CPU otherwise. It possibly means that Colab doesn't forward to the script that you have CUDA available: it is available in your notebook (since you can do torch.cuda.version
and torch.device
in the notebook I imagine), but isn't available in the script.
I don't know much about how Colab with external scripts works. But something you can try, to validate this, is to create a small script like
import torch
print(f'Torch available: {torch.cuda.is_available()}')
and call it with the !python
magic command, to ensure that it has access to GPU.
You were right! But very strange indeed!
I didn't execute it as external script, as Google Colabo works with jupyter notebooks in interactive mode only. As I need it in a notebook cell, with
import torch
print('Torch', torch.__version__, 'CUDA', torch.version.cuda)
print('Device:', torch.device('cuda:0'))
print(f'CUDA available: {torch.cuda.is_available()}')
I got
Torch 1.5.1 CUDA 10.2
Device: cuda:0
CUDA available: False
It still doesn't solve my problem, but you found the root cause. It has nothing to do with your code, but more with torch making CUDA available. It is strange, as the cuda version and cuda device is found.
Will keep you updated if I find a solution.
Update:
The problem was that the torch version 1.5.1 came with CUDA 10.2, yet the NVIDIA driver on Google Colabo has currently CUDA 10.1.
For anyone who wants to run the code in Google Colabo:
You must install torch 1.6 to match the torch CUDA driver to the Google Colabo-installed CUDA driver with
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
Verify the correct CUDA version which pytorch can see by:
print('Torch', torch.__version__, 'CUDA', torch.version.cuda)
@tristandeleu Now that I have train.py running with CUDA, I still don't see a lot of speedup compared to my Macbook, even with an NVIDIA V100. Even batch-size 2, num-batches 2, training takes 128s (Macbook: 10s).
Do you have any idea why this is still so slow? Are there any time benchmarks you can share? I could share the Colabo notebook if you want.
I don't have any formal benchmark for how fast this should be unfortunately. But looking at some logs, I was getting one epoch done in 92s on a Titan Xp (this includes 100 batches of training, and 250 batches of validation per epoch -- among these 92s, validation was taking about 35s) for MiniImagenet 5-way 5-shot with the settings from the paper (--num-steps 5 --step-size 0.01 --batch-size 4 --num-batches 100 --hidden-size 32
) and 8 workers for data loading.
Hope this helps!
This definitely helps as reference. Will do more tests and report back. On 31. Dec 2020, 20:43 +0900, Tristan Deleu notifications@github.com, wrote:
I don't have any formal benchmark for how fast this should be unfortunately. But looking at some logs, I was getting one epoch done in 92s on a Titan Xp (this includes 100 batches of training, and 250 batches of validation per epoch -- among these 92s, validation was taking about 35s) for MiniImagenet 5-way 5-shot with the settings from the paper (--num-steps 5 --step-size 0.01 --batch-size 4 --num-batches 100 --hidden-size 32) and 8 workers for data loading. Hope this helps! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
It seems that the problem is solved although I cannot reproduce the reason. All I did was restart Google Colabo, and install a few extensions. It seems that the availability of the GPU on Colabo fluctuates, as processing time can varies between 1.2 - 8.2it/s in one session!
The benchmark for your run:
!python train.py /content/sync/data \
--dataset miniimagenet \
--num-ways 5 \
--num-shots 5 \
--num-steps 5 \
--step-size 0.1 \
--batch-size 4 \
--num-batches 100 \
--num-epochs 1 \
--hidden-size 32 \
--num-workers 8 \
--output-folder /content/sync/output \
--use-cuda \
--verbose
ran smoothly in 80s per epoch using an NVIDIA V100 with
Only one detail was missing to reproduce your experiment - Can you tell me how to specify your configuration of 250 batches of validation per epoch?
Good news! I finally found the root cause of the huge time delay: My script read and wrote to folders which were synced to Google Drive. "Synced" meant actually "being synced" - and that cause the bottleneck. When I changed the folders to ones directly on the Google Colabo VM, the time per epoch went down from 240s to 0.2s in the previous test! And as mentioned before, running your benchmark from your paper took about 82 seconds to complete per epoch. To become more precise in this, I would still be grateful if you could tell me how to configure the 250 batches of validation per epoch!
That is good news! I/O can be a really big factor.
The 250 batches of validation per epoch is something I have in an internal version of the code: it comes from the fact that I am using a fixed subset of 1000 tasks from the meta-validation split for the evaluation at each epoch. The code in this repo uses num_batches * batch_size
random tasks from the meta-validation split at each epoch (they are different at every epoch). Using a fixed subset of tasks for evaluation is a better option, much closer to the way we do validation in standard supervised learning.
I won't be able to push the corresponding code, because it has a number of internal dependencies unfortunately. However I can give you some steps to reproduce it yourself if you want:
import random
from torchmeta.datasets.helpers import miniimagenet
def create_indices(dataset, numtasks): indices = set() for in range(num_tasks): indices.add(tuple(random.sample(range(len(dataset.dataset)), dataset.num_classes_per_task))) indices = list(list(x) for x in indices) return indices
dataset = miniimagenet('data', shots=5, ways=5, meta_val=True) indices = create_indices(dataset, 1000) # Sample 1000 tasks
- Save the indices in a separate file to freeze this subset of indices and reload it later:
```python
import json
with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'w') as f:
meta_val_indices = json.dump(indices, f)
get_benchmark_by_name
, use Subset
class from PyTorch to only take a subset from meta_val_dataset
# Load the indices (e.g. using json)
# with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'r') as f:
# meta_val_indices = json.load(f)
meta_val_dataset = Subset(meta_val_dataset, meta_val_indices)
This is missing a lot of the logic (e.g. how to fetch the correct json file for a specific dataset
, shots
and ways
arguments of get_benchmark_by_name
), but I hope this helps!
This is super helpful, thanks a lot. Coming from psychology, it will be challenging to implement this function flawlessly but I will try. Thanks again! On 4. Jan 2021, 00:09 +0900, Tristan Deleu notifications@github.com, wrote:
That is good news! I/O can be a really big factor. The 250 batches of validation per epoch is something I have in an internal version of the code: it comes from the fact that I am using a fixed subset of 1000 tasks from the meta-validation split for the evaluation at each epoch. The code in this repo uses num_batches * batch_size random tasks from the meta-validation split at each epoch (they are different at every epoch). Using a fixed subset of tasks for evaluation is a better option, much closer to the way we do validation in standard supervised learning. I won't be able to push the corresponding code, because it has a number of internal dependencies unfortunately. However I can give you some steps to reproduce it yourself if you want:
• Create the subset of indices for a specific dataset; I'm taking MiniImagenet 5-way 5-shots as an example. I have a utility function to do that (which is not perfect, because it could return fewer tasks than requested, but that did the trick for me):
import random from torchmeta.datasets.helpers import miniimagenet
def create_indices(dataset, numtasks): indices = set() for in range(num_tasks): indices.add(tuple(random.sample(range(len(dataset.dataset)), dataset.num_classes_per_task))) indices = list(list(x) for x in indices) return indices
dataset = miniimagenet('data', shots=5, ways=5, meta_val=True) indices = create_indices(dataset, 1000) # Sample 1000 tasks
• Save the indices in a separate file to freeze this subset of indices and reload it later:
import json
with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'w') as f: meta_val_indices = json.dump(indices, f)
• In get_benchmark_by_name, use Subset class from PyTorch to only take a subset from meta_val_dataset
Load the indices (e.g. using json)
with open('path/to/val_indices/miniimagenet_5way_5shot.json', 'r') as f:
meta_val_indices = json.load(f)
meta_val_dataset = Subset(meta_val_dataset, meta_val_indices) This is missing a lot of the logic (e.g. how to fetch the correct json file for a specific dataset, shots and ways arguments of get_benchmark_by_name), but I hope this helps! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
I am running train.py successfully on my local machine (Macbook Pro 16). Yet in Google Colabo, it seems to take an endless time so start the first epoch (the empty progressbar is shown).
I verified that CUDA is available:
The script starts as:
and gives:
So this makes 242s or 4min per iteration, whereas the same configuration and identical code on my Macbook Pro without GPU takes only about 2.4s per iteration - factor 100:
Why is this is the case?
A hint is that the GPU is not used, as Colabo shows a popup window after some minutes saying:
Warning: you are connected to a GPU runtime, but not utilizing the GPU. Change to a standard runtime
In this case, is 16 CPU cores on Macbook vs. 2 in Google Colabo. This doesn't account for the factor 100 between them but might be a hint.I am convinced that this code must be super fast when running on an NVIDIA P100. So I would be very grateful for any hints!