Training CNN on Collab runs in duplicate and crashes - Githubissues

the-full-stack / fsdl-text-recognizer-2021-labs

Complete deep learning project developed in Full Stack Deep Learning, Spring 2021

https://bit.ly/berkeleyfsdl

MIT License

452 stars 281 forks source link

Training CNN on Collab runs in duplicate and crashes #16

Closed GavinR1 closed 3 years ago

GavinR1 commented 3 years ago

I'm running Lab 2 through Collab and going through the 01-look-at-emnist.ipynb as well.

When I get to the step of "Train a CNN model" with: import pytorch_lightning as pl from text_recognizer.models import CNN from text_recognizer.lit_models import BaseLitModel

model = CNN(data_config=data.config()) lit_model = BaseLitModel(model=model) trainer = pl.Trainer(gpus=1, max_epochs=5) trainer.fit(lit_model, datamodule=data)

The output is produced 6x in duplicate and the training epochs run extremely slow and it causes chrome to become unresponsive and crash after a few minutes. It almost reminds me of trying to run a non-MPI program with multiple MPI ranks but I don't think Collab would be set up to do this.

Full output from the cell before crashing: GPU available: True, used: True GPU available: True, used: True GPU available: True, used: True GPU available: True, used: True GPU available: True, used: True GPU available: True, used: True TPU available: None, using: 0 TPU cores TPU available: None, using: 0 TPU cores TPU available: None, using: 0 TPU cores TPU available: None, using: 0 TPU cores TPU available: None, using: 0 TPU cores TPU available: None, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | model | CNN | 1.7 M 1 | train_acc | Accuracy | 0
2 | val_acc | Accuracy | 0
3 | test_acc | Accuracy | 0

1.7 M Trainable params 0 Non-trainable params 1.7 M Total params

| Name | Type | Params

0 | model | CNN | 1.7 M 1 | train_acc | Accuracy | 0
2 | val_acc | Accuracy | 0
3 | test_acc | Accuracy | 0

1.7 M Trainable params 0 Non-trainable params 1.7 M Total params

| Name | Type | Params

0 | model | CNN | 1.7 M 1 | train_acc | Accuracy | 0
2 | val_acc | Accuracy | 0
3 | test_acc | Accuracy | 0

1.7 M Trainable params 0 Non-trainable params 1.7 M Total params

| Name | Type | Params

0 | model | CNN | 1.7 M 1 | train_acc | Accuracy | 0
2 | val_acc | Accuracy | 0
3 | test_acc | Accuracy | 0

1.7 M Trainable params 0 Non-trainable params 1.7 M Total params

| Name | Type | Params

0 | model | CNN | 1.7 M 1 | train_acc | Accuracy | 0
2 | val_acc | Accuracy | 0
3 | test_acc | Accuracy | 0

1.7 M Trainable params 0 Non-trainable params 1.7 M Total params

| Name | Type | Params

0 | model | CNN | 1.7 M 1 | train_acc | Accuracy | 0
2 | val_acc | Accuracy | 0
3 | test_acc | Accuracy | 0

1.7 M Trainable params 0 Non-trainable params 1.7 M Total params

GavinR1 commented 3 years ago

I'm closing this as I don't think it's a code issue but something strange going on with my Collab. I loaded everything into a fresh notebook and this time it ran with duplicates of two instead of six.

I fixed the crashing by adding progress_bar_refresh_rate=50 in trainer = pl.Trainer

Also to run this notebook in collab I needed to edit /usr/local/lib/python3.7/dist-packages/pytorch-lightning/utilities/apply_func.pc on line 25 to: from torchtext.legacy.data import Batch rather than from torchtext.data import Batch.