vkinakh / mv-mr

Official implementation of the paper "MV-MR: multi-views and multi-representations for self-supervised learning and knowledge distillation"
4 stars 0 forks source link

The training model reported an error! #1

Open Kyle-fang opened 9 months ago

Kyle-fang commented 9 months ago

python main_self_supervised.py --config configs\stl10_self_supervised.yaml

E:\Anaconda\envs\mv_mr\lib\site-packages\torchvision\models_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead. warnings.warn( E:\Anaconda\envs\mv_mr\lib\site-packages\torchvision\models_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=None. warnings.warn(msg) E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\connectors\accelerator_connector.py:898: UserWarning: You are running on single node with no parallelization, so distributed has no effect. rank_zero_warn("You are running on single node with no parallelization, so distributed has no effect.") E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\connectors\accelerator_connector.py:658: UserWarning: You passed Trainer(accelerator='cpu', precision=16) but native AMP is not supported on CPU. Using precision='bf16' instead. rank_zero_warn( Using bfloat16 Automatic Mixed Precision (AMP) GPU available: False, used: False TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:291: LightningDeprecationWarning: Base Callback.on_train_batch_end hook signature has changed in v1.5. The dataloader_idx argument will be removed in v1.7. rank_zero_deprecation( Files already downloaded and verified Files already downloaded and verified Missing logger folder: lightning_logs\2023-09-19-15-52-00_stl10_self_supervised_scale_z Files already downloaded and verified Files already downloaded and verified

| Name | Type | Params

0 | _encoder | ResnetMultiProj | 174 M 1 | _loss_dc | DistanceCorrelation | 0 2 | _scatnet | Scattering2D | 0 3 | _hog | HOGLayer | 0 4 | _identity | Identity | 0 5 | online_finetuner | Linear | 20.5 K

174 M Trainable params 0 Non-trainable params 174 M Total params 698.194 Total estimated model params size (MB) Validation sanity check: 0it [00:00, ?it/s]Files already downloaded and verified Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]Traceback (most recent call last): File "F:\Fangweijie\mv-mr-main\main_self_supervised.py", line 92, in main(args) File "F:\Fangweijie\mv-mr-main\main_self_supervised.py", line 74, in main trainer.fit(module, ckpt_path=path_checkpoint) File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 740, in fit self._call_and_handle_interrupt( File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 685, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 777, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1199, in _run self._dispatch() File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1279, in _dispatch self.training_type_plugin.start_training(self) File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 202, in start_training self._results = trainer.run_stage() File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1289, in run_stage return self._run_train() File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1311, in _run_train self._run_sanity_check(self.lightning_module) File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1375, in _run_sanity_check self._evaluation_loop.run() File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\loops\base.py", line 145, in run self.advance(*args, *kwargs) File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 110, in advance dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders) File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\loops\base.py", line 145, in run self.advance(args, kwargs) File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\loops\epoch\evaluation_epoch_loop.py", line 122, in advance output = self._evaluation_step(batch, batch_idx, dataloader_idx) File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\loops\epoch\evaluation_epoch_loop.py", line 217, in _evaluation_step output = self.trainer.accelerator.validation_step(step_kwargs) File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 239, in validation_step return self.training_type_plugin.validation_step(step_kwargs.values()) File "E:\Anaconda\envs\mv_mr\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 219, in validation_step return self.model.validation_step(args, *kwargs) File "F:\Fangweijie\mv-mr-main\src\model\self_supervised_module.py", line 277, in validation_step return self.step(batch, batch_idx, stage='val') File "F:\Fangweijie\mv-mr-main\src\model\self_supervised_module.py", line 260, in step loss_dc = self._step_dc(im_orig, z_scaled, representation) File "F:\Fangweijie\mv-mr-main\src\model\self_supervised_module.py", line 194, in _step_dc im_orig_hog = self._hog(im_orig_r) File "E:\Anaconda\envs\mv_mr\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "F:\Fangweijie\mv-mr-main\src\model\hog.py", line 39, in forward out.scatter_(1, phase_int.floor().long() % self.nbins, norm) RuntimeError: scatter(): Expected self.dtype to be equal to src.dtype

Kyle-fang commented 9 months ago

Here is my configuration file:

batch_size: 64 # batch size for training epochs: 100 # number of epochs to train for warmup_epochs: 10 # number of warmup epochs (without learning rate decay) log_every: 100 # frequency of logging (steps) eval_every: 1 # frequency of evaluating on val set (epochs) n_workers: 8 # number of workers for dataloader fp16: True # whether to use fp16 precision accumulate_grad_batches: 1 # number of accumulation steps normalize_z: False # whether to normalize z in encoder

fine_tune_from: # path to pre-trained model to fine-tune from

std_margin: 1 # margin for the std values in the loss

optimizer: adam # optimizer to use. Options: adam, adamw wd: 1e-6 # weight decay lr: 1e-4 # learning rate scheduler: warmup_cosine # type of scheduler to use. Options: cosine, multistep, warmup_cosine

dataset: # dataset parameters name: stl10 # dataset name size: 96 # image size n_classes: 10 # number of classes path: # path to dataset (leave empty) aug_policy: custom # augmentation policy. Choices: autoaugment, randaugment, custom

scatnet: # scatnet parameters J: 2 # number of scales shape: # shape of the input image

encoder_type: resnet # choices: resnet, deit encoder: # encoder parameters out_dim: 8192-8192-8192 # number of neurons in the projection layer small_kernel: True # whether to use small kernels. Small kernels are used for STL10 dataset

hog: # histogram of oriented gradients parameters nbins: 24 # number of bins pool: 8 # pooling size

comment: stl10_self_supervised_scale_z

Kyle-fang commented 9 months ago

The dataset has also been set up!

无标题

vkinakh commented 9 months ago

You are using 16bit precision, while training on CPU. I do not think that is possible to do. Try changing the precision to 32