Closed Karol-G closed 3 years ago
Hi @Karol-G , Thanks for using GaNDLF.
It would be great if you comment out the line :
#parallel_compute_command: 'qsub -b y -l gpu -l h_vmem=32G -cwd -o ${outputDir}/\$JOB_ID.stdout -e ${outputDir}/\$JOB_ID.stderr `pwd`/sge_wrapper _correct_location_of_virtual_environment_/venv/bin/python'
so that it doesn't attempt to do a parallel compute, similarly, I would also recommend you to disable folded validation as:
nested_training:
{
testing: 1, # this controls the testing data splits for final model evaluation; use '1' if this is to be disabled
validation: -5 # this controls the validation data splits for model training
}
for running it once.
I commented out parallel_compute_command
and testing
and validation
instead of nested_training, else I get the error The parameter 'nested_training' needs to be defined
.
But now i get the following error:
Using default folds for testing split: -5
Using default folds for validation split: -5
Using previously saved parameter file ./experiments/2d_classification/output_dir/parameters.pkl
Traceback (most recent call last):
File "gandlf_run", line 75, in <module>
main()
File "gandlf_run", line 70, in main
TrainingManager(dataframe=data_full, headers = headers, outputDir=model_path, parameters=parameters, device=device, reset_prev = reset_prev)
File "/content/GaNDLF-refactor/GANDLF/training_manager.py", line 146, in TrainingManager
device=device, params=parameters, testing_data=testingData)
File "/content/GaNDLF-refactor/GANDLF/training_loop.py", line 322, in training_loop
metrics = params["metrics"]
KeyError: 'metrics'
Seems a metric is missing. How do I define it for a classification task?
HI @Karol-G,
You can add a key metrics
in the parameter file with the value ['mse']
for classification. I will update this on the testing config for clarity.
Cheers, Sarthak
Hmm, I still get the same error with:
metrics:
- mse
or:
metrics: ['mse']
Ah, I think you need to try with this:
python gandlf_run -config ./experiments/2d_classification/model.yaml -data ./experiments/2d_classification/train.csv -output ./experiments/2d_classification/output_dir/ -train 1 -device cuda \
-reset_prev True # this will remove all writes to disk (such as training/validation data and parameters) from previous run
Ah yes that fixed it, thanks!
Hi Sarthak,
when I try to train on the toy dataset with the samples/config_classification.yaml I get the error
/bin/sh: 1: qsub: not found
. I believe this originates from 'parallel_compute_command' in the config. I am using the newest pull from gandalf-refactor and am using Linux.The train command:
python gandlf_run -config ./experiments/2d_classification/model.yaml -data ./experiments/2d_classification/train.csv -output ./experiments/2d_classification/output_dir/ -train 1 -device cuda
Full error log:
This is the model.yaml (which is the samples/config_classification.yaml):
Best Karol