mlcommons / GaNDLF

A generalizable application framework for segmentation, regression, and classification using PyTorch
https://gandlf.org
Apache License 2.0
163 stars 79 forks source link

/bin/sh: 1: qsub: not found #27

Closed Karol-G closed 3 years ago

Karol-G commented 3 years ago

Hi Sarthak,

when I try to train on the toy dataset with the samples/config_classification.yaml I get the error /bin/sh: 1: qsub: not found. I believe this originates from 'parallel_compute_command' in the config. I am using the newest pull from gandalf-refactor and am using Linux.

The train command:

python gandlf_run -config ./experiments/2d_classification/model.yaml -data ./experiments/2d_classification/train.csv -output ./experiments/2d_classification/output_dir/ -train 1 -device cuda

Full error log:

Submitting job for testing split 0 and validation split 0
/bin/sh: 1: qsub: not found
Submitting job for testing split 0 and validation split 1
/bin/sh: 1: qsub: not found
Submitting job for testing split 0 and validation split 2
/bin/sh: 1: qsub: not found
Submitting job for testing split 0 and validation split 3
/bin/sh: 1: qsub: not found
Submitting job for testing split 0 and validation split 4
/bin/sh: 1: qsub: not found
Submitting job for testing split 1 and validation split 0
/bin/sh: 1: qsub: not found
Submitting job for testing split 1 and validation split 1
/bin/sh: 1: qsub: not found
Submitting job for testing split 1 and validation split 2
/bin/sh: 1: qsub: not found
Submitting job for testing split 1 and validation split 3
/bin/sh: 1: qsub: not found
Submitting job for testing split 1 and validation split 4
/bin/sh: 1: qsub: not found
Submitting job for testing split 2 and validation split 0
/bin/sh: 1: qsub: not found
Submitting job for testing split 2 and validation split 1
/bin/sh: 1: qsub: not found
Submitting job for testing split 2 and validation split 2
/bin/sh: 1: qsub: not found
Submitting job for testing split 2 and validation split 3
/bin/sh: 1: qsub: not found
Submitting job for testing split 2 and validation split 4
/bin/sh: 1: qsub: not found
Submitting job for testing split 3 and validation split 0
/bin/sh: 1: qsub: not found
Submitting job for testing split 3 and validation split 1
/bin/sh: 1: qsub: not found
Submitting job for testing split 3 and validation split 2
/bin/sh: 1: qsub: not found
Submitting job for testing split 3 and validation split 3
/bin/sh: 1: qsub: not found
Submitting job for testing split 3 and validation split 4
/bin/sh: 1: qsub: not found
Submitting job for testing split 4 and validation split 0
/bin/sh: 1: qsub: not found
Submitting job for testing split 4 and validation split 1
/bin/sh: 1: qsub: not found
Submitting job for testing split 4 and validation split 2
/bin/sh: 1: qsub: not found
Submitting job for testing split 4 and validation split 3
/bin/sh: 1: qsub: not found
Submitting job for testing split 4 and validation split 4
/bin/sh: 1: qsub: not found

This is the model.yaml (which is the samples/config_classification.yaml):

# affix version
version:
  {
    minimum: 0.0.8,
    maximum: 0.0.8 # this should NOT be made a variable, but should be tested after every tag is created
  }
# Choose the model parameters here
model:
  {
    dimension: 3, # the dimension of the model and dataset: defines dimensionality of computations
    base_filters: 30, # Set base filters: number of filters present in the initial module of the U-Net convolution; for IncU-Net, keep this divisible by 4
    architecture: vgg16, # options: unet, resunet, fcn, uinc, vgg, densenet
    batch_norm: True, # this is only used for vgg
    final_layer: None, # can be either sigmoid, softmax or none (none == regression)
    amp: False, # Set if you want to use Automatic Mixed Precision for your operations or not - options: True, False
    n_channels: 3, # set the input channels - useful when reading RGB or images that have vectored pixel types
  }
# this is to enable or disable lazy loading - setting to true reads all data once during data loading, resulting in improvements
# in I/O at the expense of memory consumption
in_memory: False
# this will save the generated masks for validation and testing data for qualitative analysis
save_masks: False
# Set the Modality : rad for radiology, path for histopathology
modality: rad
# Patch size during training - 2D patch for breast images since third dimension is not patched 
patch_size: [64,64,64]
# uniform: UniformSampler or label: LabelSampler
patch_sampler: uniform
# Number of epochs
num_epochs: 100
# Set the patience - measured in number of epochs after which, if the performance metric does not improve, exit the training loop - defaults to the number of epochs
patience: 50
# Set the batch size
batch_size: 1
# Set the initial learning rate
learning_rate: 0.001
# Learning rate scheduler - options: triangle, triangle_modified, exp, reduce-on-lr, step, more to come soon - default hyperparameters can be changed thru code
scheduler: triangle
# Set which loss function you want to use - options : 'dc' - for dice only, 'dcce' - for sum of dice and CE and you can guess the next (only lower-case please)
# options: dc (dice only), dc_log (-log of dice), ce (), dcce (sum of dice and ce), mse () ...
# mse is the MSE defined by torch and can define a variable 'reduction'; see https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss
# use mse_torch for regression/classification problems and dice for segmentation
loss_function: mse
# this parameter weights the loss to handle imbalanced losses better
weighted_loss: True 
#loss_function:
#  {
#    'mse':{
#      'reduction': 'mean' # see https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss for all options
#    }
#  }
# Which optimizer do you want to use - adam/sgd
opt: adam
# this parameter controls the nested training process
# performs randomized k-fold cross-validation
# split is performed using sklearn's KFold method
# for single fold run, use '-' before the fold number
nested_training:
  {
    testing: 5, # this controls the testing data splits for final model evaluation; use '1' if this is to be disabled
    validation: 5 # this controls the validation data splits for model training
  }
## pre-processing
# this constructs an order of transformations, which is applied to all images in the data loader
# order: resize --> threshold/clip --> resample --> normalize
# 'threshold': performs intensity thresholding; i.e., if x[i] < min: x[i] = 0; and if x[i] > max: x[i] = 0
# 'clip': performs intensity clipping; i.e., if x[i] < min: x[i] = min; and if x[i] > max: x[i] = max
# 'threshold'/'clip': if either min/max is not defined, it is taken as the minimum/maximum of the image, respectively
# 'normalize': performs z-score normalization: https://torchio.readthedocs.io/transforms/preprocessing.html?highlight=ToCanonical#torchio.transforms.ZNormalization
# 'normalize_nonZero': perform z-score normalize but with mean and std-dev calculated on only non-zero pixels
# 'normalize_nonZero_masked': perform z-score normalize but with mean and std-dev calculated on only non-zero pixels with the stats applied on non-zero pixels
# 'crop_external_zero_planes': crops all non-zero planes from input tensor to reduce image search space
# 'resample: resolution: X,Y,Z': resample the voxel resolution: https://torchio.readthedocs.io/transforms/preprocessing.html?highlight=ToCanonical#torchio.transforms.Resample
# 'resample: resolution: X': resample the voxel resolution in an isotropic manner: https://torchio.readthedocs.io/transforms/preprocessing.html?highlight=ToCanonical#torchio.transforms.Resample
# resize the image(s) and mask (this should be greater than or equal to patch_size); resize is done ONLY when resample is not defined
data_preprocessing:
  {
    'normalize',
    # 'normalize_nonZero', # this performs z-score normalization only on non-zero pixels
    'resample':{
      'resolution': [1,2,3]
    },
    #'resize': [128,128], # this is generally not recommended, as it changes image properties in unexpected ways
    'crop_external_zero_planes', # this will crop all zero-valued planes across all axes
  }
# various data augmentation techniques
# options: affine, elastic, downsample, motion, ghosting, bias, blur, gaussianNoise, swap
# keep/edit as needed
# all transforms: https://torchio.readthedocs.io/transforms/transforms.html?highlight=transforms
# 'kspace': one of motion, ghosting or spiking is picked (randomly) for augmentation
# 'probability' subkey adds the probability of the particular augmentation getting added during training (this is always 1 for normalize and resampling)
data_augmentation: 
  {
    default_probability: 0.5,
    'affine',
    'elastic',
    'kspace':{
      'probability': 1
    },
    'bias',
    'blur': {
      'std': [0, 1] # default std-dev range, for details, see https://torchio.readthedocs.io/transforms/augmentation.html?highlight=randomblur#torchio.transforms.RandomBlur
    },
    'noise': { # for details, see https://torchio.readthedocs.io/transforms/augmentation.html?highlight=randomblur#torchio.transforms.RandomNoise
      'mean': 0, # default mean
      'std': [0, 1] # default std-dev range
    },
    'anisotropic':{
      'axis': [0,1],
      'downsampling': [2,2.5]
    },
  }
# parallel training on HPC - here goes the command to prepend to send to a high performance computing
# cluster for parallel computing during multi-fold training
# not used for single fold training
# this gets passed before the training_loop, so ensure enough memory is provided along with other parameters
# that your HPC would expect
# ${outputDir} will be changed to the outputDir you pass in CLI + '/${fold_number}'
# ensure that the correct location of the virtual environment is getting invoked, otherwise it would pick up the system python, which might not have all dependencies
parallel_compute_command: 'qsub -b y -l gpu -l h_vmem=32G -cwd -o ${outputDir}/\$JOB_ID.stdout -e ${outputDir}/\$JOB_ID.stderr `pwd`/sge_wrapper _correct_location_of_virtual_environment_/venv/bin/python'
## queue configuration - https://torchio.readthedocs.io/data/patch_training.html?#queue
# this determines the maximum number of patches that can be stored in the queue. Using a large number means that the queue needs to be filled less often, but more CPU memory is needed to store the patches
q_max_length: 40
# this determines the number of patches to extract from each volume. A small number of patches ensures a large variability in the queue, but training will be slower
q_samples_per_volume: 5
# this determines the number subprocesses to use for data loading; '0' means main process is used
q_num_workers: 2 # scale this according to available CPU resources
# used for debugging
q_verbose: False

Best Karol

Geeks-Sid commented 3 years ago

Hi @Karol-G , Thanks for using GaNDLF.

It would be great if you comment out the line :

#parallel_compute_command: 'qsub -b y -l gpu -l h_vmem=32G -cwd -o ${outputDir}/\$JOB_ID.stdout -e ${outputDir}/\$JOB_ID.stderr `pwd`/sge_wrapper _correct_location_of_virtual_environment_/venv/bin/python'

so that it doesn't attempt to do a parallel compute, similarly, I would also recommend you to disable folded validation as:

nested_training:
  {
    testing: 1, # this controls the testing data splits for final model evaluation; use '1' if this is to be disabled
    validation: -5 # this controls the validation data splits for model training
  }

for running it once.

Karol-G commented 3 years ago

I commented out parallel_compute_commandand testing and validation instead of nested_training, else I get the error The parameter 'nested_training' needs to be defined. But now i get the following error:

Using default folds for testing split:  -5
Using default folds for validation split:  -5
Using previously saved parameter file ./experiments/2d_classification/output_dir/parameters.pkl
Traceback (most recent call last):
  File "gandlf_run", line 75, in <module>
    main()
  File "gandlf_run", line 70, in main
    TrainingManager(dataframe=data_full, headers = headers, outputDir=model_path, parameters=parameters, device=device, reset_prev = reset_prev)
  File "/content/GaNDLF-refactor/GANDLF/training_manager.py", line 146, in TrainingManager
    device=device, params=parameters, testing_data=testingData)
  File "/content/GaNDLF-refactor/GANDLF/training_loop.py", line 322, in training_loop
    metrics = params["metrics"]
KeyError: 'metrics'

Seems a metric is missing. How do I define it for a classification task?

sarthakpati commented 3 years ago

HI @Karol-G,

You can add a key metrics in the parameter file with the value ['mse'] for classification. I will update this on the testing config for clarity.

Cheers, Sarthak

Karol-G commented 3 years ago

Hmm, I still get the same error with:

metrics:
  - mse

or: metrics: ['mse']

sarthakpati commented 3 years ago

Ah, I think you need to try with this:

python gandlf_run -config ./experiments/2d_classification/model.yaml -data ./experiments/2d_classification/train.csv -output ./experiments/2d_classification/output_dir/ -train 1 -device cuda \
-reset_prev True # this will remove all writes to disk (such as training/validation data and parameters) from previous run
Karol-G commented 3 years ago

Ah yes that fixed it, thanks!