tttianhao / CLEAN

CLEAN: a contrastive learning model for high-quality functional prediction of proteins
MIT License
224 stars 44 forks source link

on single sequence query: ValueError: Only one class present in y_true. ROC AUC score is not defined in that case. #36

Closed emilyvansyoc closed 1 year ago

emilyvansyoc commented 1 year ago

Hello,

We are attempting to determine the lowest P value range of a single protein sequence using the conda CLEAN install. I.e., the input CSV is one sequence with an EC number and identifier. When we run this using infer_pvalue with default parameters, it calculates results but gives the following error/warning and does not print the model fit statistics (recall, precision etc):

ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

It seems that the AUC cannot be calculated on only one input sequence. Does this affect the P value cutoff or model interpretation? Does the infer_pvalue function depend on multiple queries in an input file? If multiple sequences are required, how should we interpret the predictions for the one query sequence of interest?

Thanks so much for your help.

canallee commented 1 year ago

ROC AUC score is not defined when you only have one group truth label, and you can comment out and only print other metrics. This shouldn't affect the P value cutoff or model interpretation, since P value is computed against the training set rather than the queries.

emilyvansyoc commented 1 year ago

Ok, thank you for your response! Could you clarify where to comment out this line? I have done the following and neither are working:

In src/CLEAN/infer.py:

if report_metrics: pred_label = get_pred_labels(out_filename, pred_type='_pvalue') pred_probs = get_pred_probs(out_filename, pred_type='_pvalue') true_label, all_label = get_true_labels('./data/' + test_data) pre, rec, f1, roc, acc = get_eval_metrics( pred_label, pred_probs, true_label, all_label) print(f'############ EC calling results using random ' f'chosen {nk_random}k samples ############') print('-' * 75) print(f'>>> total samples: {len(true_label)} | total ec: {len(all_label)} \n' f'>>> precision: {pre:.3} | recall: {rec:.3}' f'| F1: {f1:.3} |') #AUC: {roc:.3} ') print('-' * 75)

In src/CLEAN/evaluate.py:

`pre = precision_score(true_m, pred_m, average='weighted', zero_division=0) rec = recall_score(true_m, pred_m, average='weighted') f1 = f1_score(true_m, pred_m, average='weighted')

roc = roc_auc_score(true_m, pred_m_auc, average='weighted')

acc = accuracy_score(true_m, pred_m)

return pre, rec, f1#, roc, acc`