szczurek-lab / hydramp

HydrAMP: a deep generative model for antimicrobial peptide discovery
https://hydramp.mimuw.edu.pl/
MIT License
37 stars 9 forks source link

question on mic data #5

Open chq1155 opened 1 year ago

chq1155 commented 1 year ago

Hi, in the script mic_classifier_training_prodecure.ipynb, there are about 3000 mic_x_train, and about 10000 negatives_x_train.

But why in the training output, it says 'Train on 20457 samples, validate on 1312 samples'?

Thank you for your time

dzjxzyd commented 11 months ago

Hi, I have the similar question in regarding the script "mic_classifier_training_prodecure.ipynb".

x_train = np.concatenate([mic_x_train, negatives_x_train]) y_train = np.concatenate([mic_y_train, np.zeros(len(negatives_x_train))])

At this line, why we combine the negative dataset (assumed, retrieved from UniProt) with the inactive dataset of MIC. the dataset will be come highly imbalanced. Those(assumed) negative sequences from Uniprot is easier to be predicted as negative.

Thanks for your time. Sincerely, Zhenjiao