question on mic data - Githubissues

Hi, I have the similar question in regarding the script "mic_classifier_training_prodecure.ipynb".

x_train = np.concatenate([mic_x_train, negatives_x_train]) y_train = np.concatenate([mic_y_train, np.zeros(len(negatives_x_train))])

At this line, why we combine the negative dataset (assumed, retrieved from UniProt) with the inactive dataset of MIC. the dataset will be come highly imbalanced. Those(assumed) negative sequences from Uniprot is easier to be predicted as negative.

Thanks for your time. Sincerely, Zhenjiao

szczurek-lab / hydramp

question on mic data #5