openvax / mhcflurry

Peptide-MHC I binding affinity prediction
http://openvax.github.io/mhcflurry/
Apache License 2.0
193 stars 58 forks source link

mhcflurry-class1-select-allele-specific-models gets stuck at 60% #140

Closed haoyangz closed 5 years ago

haoyangz commented 5 years ago

I am trying to fit my own model and it seems the model selection step (see below) gets stuck at about 60% every time. When this happens, the mhcflurry-class processes are using very small amount of CPU resource and no GPU resource at all.

When training on a small dataset (10k entries), the stuck time is tolerable (20 min) and eventually it slowly moves on and finishes. But it seems to be >4 hours (still waiting as of now) on a 40k-entry dataset. Any idea what's going on?

My command:

time CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES mhcflurry-class1-select-allele-specific-models \
        --data valid.csv \
            --models-dir models \
                --out-models-dir selected-models \
                    --scoring combined:mse,consensus \
                        --consensus-num-peptides-per-length 10000 \
                            --combined-min-models 8 \
                                --combined-max-models 16 \
                                    --num-jobs $(expr $PROCESSORS \* 2) --gpus $N_GPU --max-workers-per-gpu 2 --max-tasks-per-worker 5

The stdout when it gets stuck

 66%|######5   | 82/125 [47:59<02:08,  3.00s/it]{'allele': 'HLA-A*80:01',
 'num_models': 16,
 'selected': <mhcflurry.class1_affinity_predictor.Class1AffinityPredictor object at 0x7f3e56fdbc50>,
 'selector_score_plan': 'mse (95 points)(|95.000|), consensus (80000 points)(|10.000|)',
 'unselected_accuracy_score_percentile': 100.0,
 'unselected_combined_score_terms': '[92.36553659928902]',
 'unselected_score': 92.36553659928902,
 'unselected_score_AUC@15000': 0.9385171790235082,
 'unselected_score_AUC@500': 0.9710843373493976,
 'unselected_score_AUC@5000': 0.9451923076923077,
 'unselected_score_MSE': 0.027731193691694633,
 'unselected_score_pearsonr': 0.6747567816729287,
 'unselected_score_plan': 'mse (95 points)(|95.000|)',
 'unselected_score_scrambled_mean': 87.29999222882374}
 66%|######6   | 83/125 [48:44<11:00, 15.72s/it]
timodonnell commented 5 years ago

I've also encountered something like this, but am not sure what the cause is unfortunately. Do you still hit it if you set --num-jobs 1?