rmaphoh / RETFound_MAE

RETFound - A foundation model for retinal image
Other
311 stars 63 forks source link

About a issue after running at terminal #15

Closed DavinciWu closed 7 months ago

DavinciWu commented 8 months ago

Why i run a code with command :python -m torch.distributed.launch --nproc_per_node=1 --master_port=48798 main_finetune.py --batch_size 16 --world_size 1 --model vit_large_patch16 --epochs 50 --blr 5e-3 --layer_decay 0.65 --weight_decay 0.05 --drop_path 0.2 --nb_classes 5 --data_path ./Task1/ --task ./finetune_IDRiD/ --finetune ./RETFound_cfp_weights.pth --input_size 224 firstly,it prints lots of information including parameter ,model architecture After that, it raise a error: raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/wuwentao/anaconda3/envs/retfound/bin/python', '-u', 'main_finetune.py', '--local_rank=0', '--batch_size', '16', '--world_size', '1', '--model', 'vit_large_patch16', '--epochs', '50', '--blr', '5e-3', '--layer_decay', '0.65', '--weight_decay', '0.05', '--drop_path', '0.2', '--nb_classes', '5', '--data_path', './Task1/', '--task', './finetune_IDRiD/', '--finetune', './RETFound_cfp_weights.pth', '--input_size', '224']' returned non-zero exit status 1.

DavinciWu commented 8 months ago

I've been able to solve the above problem, but here's the question: why the output _metrics_test.csv file contains only one line of data, unlike _metrics_val.csv which contains epoch lines of data. And then what does the metric_logger.loss in it represent. Looking forward to your answer thanks!

KevinZ4 commented 8 months ago

I got the same error: raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) command[......] return non-zero exit status 1. Could you please tell me how to solve this problem?

rmaphoh commented 8 months ago

I've been able to solve the above problem, but here's the question: why the output _metrics_test.csv file contains only one line of data, unlike _metrics_val.csv which contains epoch lines of data. And then what does the metric_logger.loss in it represent. Looking forward to your answer thanks!

The _metrics_val.csv records the val performance for each epoch during training while _metrics_test.csv includes the test performance with only the best checkpoint. The metric_logger.loss records the loss value on val set (_metrics_val.csv) or test set (_metrics_test.csv).