Interpreting Sequence_Inference output

CurlsForScience commented 3 years ago

Hello,

I would like to train a supervised model with known antigen specificity, then use that model to classify new TCR sequences as potentially targeting certain antigens. I have followed along with the tutorials, but am still unclear on the best way to do this. I believe the closest is the "8 - VAE Inference.ipynb" tutorial but using a supervised model rather than the unsupervised. However, I am unclear on how to interpret the output from Sequence_Inference. I am using the example data Mouse Antigens for the model and Rudqvist for the new dataset. The resulting "features" object is 23856x9 which I believe corresponds to the individual TCR sequences (23856) and 9 different antigens with the entriesS being scores for how well the TCR sequence fits that antigen.

1) Does a higher or lower score mean the TCR sequence fits better with the given antigen?

I tried to assess this myself by looking at the features of the supervised model, however this object has 224 columns. I was expecting this to have 9 corresponding with the different antigens.

2) What do the columns of the features object from the supervised model correspond to?

3) Would you suggest this method of classification, or something more akin to this tutorial "3 - Supervised Sequence Regression.ipynb"?

Thank you for your help!

sidhomj commented 3 years ago

Does a higher or lower score mean the TCR sequence fits better with the given antigen?

Correct. When you run the supervised model inference engine, you will generate a probability of each sequence for each class. As you noticed, in the case of the example data, these correspond to the 9 trained antigens.

What do the columns of the features object from the supervised model correspond to?

These are lower level features in the network (after the convolutional/fully connected layers). I wouldn't use these when one has the actual probability of each sequence to each class. I would only use these lower level features from the VAE because with the VAE, this is the best representation of each sequence.

Would you suggest this method of classification, or something more akin to this tutorial "3 - Supervised Sequence Regression.ipynb"?

I would recommend training a supervised classification model if your data has labels for each sequence. Then using that trained model to do inference on new data and then using the Inference Sequence functionality of DeepTCR to obtain probabilities of the new sequences for each class trained by the original mode. Notably, the order of the columns (i.e. 9 antigens) can be found under the obj.lb.classes_ . This array will tell you the order of the labels.

I hope this is helpful.

CurlsForScience commented 3 years ago

Thank you so much. That is very helpful. A couple of additional questions.

Regarding the first question I had. Does a higher score indicate a better fit? Or does a lower score indicate a better fit (like p-value)?

2) The reason I had asked about the number of columns for the features object of the supervised model is because I am not able to construct the UMAP following the tutorial "8 - VAE Inference.ipynb". When running the following line, I get the following error.

features_new = umap_obj.transform(features) ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 3 while Y.shape[1] == 224

I constructed umap_obj as described in the tutorial using the DTCR_SS.features, but the dimensions of DTCR_SS.features is 2325x224 whereas the features object is 199x3. Should I use a different slot from my supervised model to build the original umap?

3) I am trying the motif identification module to get motifs for the groups identified from the phenograph clustering. I am unclear how to change the classes slot to reflect the phenograph clusters in order to use Motif_Identification. How should I incorporate this information?

Thank you so much for your help!

sidhomj / DeepTCR

Interpreting Sequence_Inference output #46