Colab Error - Githubissues

ther3zz commented 1 year ago

Hi, I'm trying to run training on google colab and am running into the following error in the training step, does anyone have an idea here? Seems like the CUDA in use is not compatible with pytorch?

DEBUG:piper_train:Namespace(dataset_dir='/content/drive/MyDrive/colab/piper2/sjmodelv3', checkpoint_epochs=5, quality='medium', resume_from_single_speaker_checkpoint=None, logger=True, enable_checkpointing=True, default_root_dir=None, gradient_clip_val=None, gradient_clip_algorithm=None, num_nodes=1, num_processes=None, devices='1', gpus=None, auto_select_gpus=False, tpu_cores=None, ipus=None, enable_progress_bar=True, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=None, max_epochs=10000, min_epochs=None, max_steps=-1, min_steps=None, max_time=None, limit_train_batches=None, limit_val_batches=None, limit_test_batches=None, limit_predict_batches=None, val_check_interval=None, log_every_n_steps=1000, accelerator='gpu', strategy=None, sync_batchnorm=False, precision=32, enable_model_summary=True, weights_save_path=None, num_sanity_val_steps=2, resume_from_checkpoint='/content/pretrained.ckpt', profiler=None, benchmark=None, deterministic=None, reload_dataloaders_every_n_epochs=0, auto_lr_find=False, replace_sampler_ddp=True, detect_anomaly=False, auto_scale_batch_size=False, plugins=None, amp_backend='native', amp_level=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', batch_size=32, validation_split=0.01, num_test_examples=2, max_phoneme_ids=None, hidden_channels=192, inter_channels=192, filter_channels=768, n_layers=6, n_heads=2, seed=1234) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:52: LightningDeprecationWarning: SettingTrainer(resume_from_checkpoint=)is deprecated in v1.5 and will be removed in v1.7. Please passTrainer.fit(ckpt_path=)directly instead. rank_zero_deprecation( GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs DEBUG:piper_train:Checkpoints will be saved every 5 epoch(s) INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmp0poytl5h INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmp0poytl5h/_remote_module_non_sriptable.py 2023-10-13 00:21:49.005851: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0. 2023-10-13 00:21:49.059804: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. DEBUG:tensorflow:Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client. 2023-10-13 00:21:50.129053: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT DEBUG:h5py._conv:Creating converter from 7 to 5 DEBUG:h5py._conv:Creating converter from 5 to 7 DEBUG:h5py._conv:Creating converter from 7 to 5 DEBUG:h5py._conv:Creating converter from 5 to 7 DEBUG:jaxlib.mlir._mlir_libs:Initializing MLIR with module: _site_initialize_0 DEBUG:jaxlib.mlir._mlir_libs:Registering dialects from initializer <module 'jaxlib.mlir._mlir_libs._site_initialize_0' from '/usr/local/lib/python3.10/dist-packages/jaxlib/mlir/_mlir_libs/_site_initialize_0.so'> DEBUG:jax._src.xla_bridge:No jax_plugins namespace packages available DEBUG:jax._src.path:etils.epath found. Using etils.epath for file I/O. INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. INFO:numexpr.utils:NumExpr defaulting to 8 threads. DEBUG:vits.dataset:Loading dataset: /content/drive/MyDrive/colab/piper2/sjmodelv3/dataset.jsonl /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:737: LightningDeprecationWarning:trainer.resume_from_checkpointis deprecated in v1.5 and will be removed in v2.0. Specify the fit checkpoint path withtrainer.fit(ckpt_path=)` instead. ckpt_path = ckpt_path or self.resume_from_checkpoint /usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:145: UserWarning: NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the NVIDIA A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) Restoring states from the checkpoint path at /content/pretrained.ckpt DEBUG:fsspec.local:open file: /content/pretrained.ckpt /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:1658: UserWarning: Be aware that when using ckpt_path, callbacks used to create the checkpoint need to be provided during Trainer instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 10, 'train_time_interval': None}"]. rank_zero_warn( LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] DEBUG:fsspec.local:open file: /content/drive/MyDrive/colab/piper2/sjmodelv3/lightning_logs/version_1/hparams.yaml Restored all states from the checkpoint file at /content/pretrained.ckpt Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/content/piper/src/python/piper_train/main.py", line 152, in main() File "/content/piper/src/python/piper_train/main.py", line 129, in main trainer.fit(model) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 700, in fit self._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 654, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 741, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1274, in _run_train self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1335, in _run_sanity_check val_loop._reload_evaluation_dataloaders() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 237, in _reload_evaluation_dataloaders self.trainer.reset_val_dataloader() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1923, in reset_val_dataloader self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 378, in _reset_eval_dataloader len(dataloader) if has_len_all_ranks(dataloader, self.trainer.strategy, module) else float("inf") File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/data.py", line 140, in has_len_all_ranks if total_length == 0: RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.`

rmcpantoja commented 1 year ago

Hi, This is a good point:

NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.

I can try to change to pytorch 1.13, but I don't know if there are breaking changes. I'll let you know here and if so you can try to train it again using my fork (#177) containing the updated notebooks.

ther3zz commented 1 year ago

Hi, This is a good point:
NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
I can try to change to pytorch 1.13, but I don't know if there are breaking changes. I'll let you know here and if so you can try to train it again using my fork (#177) containing the updated notebooks.

Thank you so much!

rmcpantoja commented 11 months ago

Hi @ther3zz, The notebook was updated and now uses torch==1.13.1. It works very well in free colab. I hope it works on your a100 GPU😀

rhasspy / piper

Colab Error #237