Labelled training data used in pMTnet

tianshilu / pMTnet

Deep Learning the T Cell Receptor Binding Specificity of Neoantigen

GNU General Public License v2.0

76 stars 20 forks source link

Labelled training data used in pMTnet #6

Open madnessfish opened 2 years ago

madnessfish commented 2 years ago

Thank you for a great tool! I am still pretty new in this field.

I would like to learn more about the training process on pMTnet. I am not sure if I missed the training data in the repository. Could you please provide the training data used in pMTnet with positive and negative labels (e.g. positive/TCR_output.csv, negative/TCR_output.csv, training_positive.csv)? Thank you so much for all your efforts!

tianshilu commented 2 years ago

Hi,

Thanks for your interest. You can find positive training and negative training through the links below. https://drive.google.com/file/d/1_pf6xIK2dRql_zZ5A1BzoGvWIYVvEaBp/view?usp=sharing https://drive.google.com/file/d/1KLlH-CBS4ep6UAEeh4Zv9Eghk1hZUWj7/view?usp=sharing

Thanks!

ddd9898 commented 2 years ago

Hi @tianshilu I have a similar request. I'd like to make some comparisons on the method you proposed. Could you provide the testing data used in pMTnet with both pos/neg labels? Thank you!

tianshilu commented 2 years ago

Hi @Miles-DDD,

Please find the testing data with labels through the links below: https://drive.google.com/file/d/1iddT16YEbEh5LYULokEMoey53RPiVsXt/view?usp=sharing https://github.com/tianshilu/pMTnet/blob/master/test/input/test_input.csv

Thanks!

madnessfish commented 2 years ago

Hi @tianshilu Thank you for providing all the information!

I am curious about how the negative sets are generated (like any script?), as I have found 1912 entries are overlapping in the positive and negative training sets as the following command. Not sure if I have made any mistakes here. comm -12 <(sort -u neg_training.csv ) <(sort -u pos_training.csv ) | wc -l

Also, I would like to know how these labeled training/ test data contribute to the training_data.csv and testing_data.csv under the pMTNet/data repository.

tianshilu commented 2 years ago

Hi @madnessfish,

Thanks for your interest in our study! For each pair of TCR-pMHC, 10 negative pairs are generated by sampling 10 TCRs from the other TCRs randomly. So, there is a very small proportion overlapping between positive and negative by chance. We didn't remove the overlapped pairs from the negative dataset because they help reduce overfitting.

Negative datasets are generated from the training_data.csv and testing_data.csv as I described above. Hope this helps!

Tianshi