sidhomj / DeepTCR

Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data
https://sidhomj.github.io/DeepTCR/
MIT License
110 stars 40 forks source link

Sequence not featurized #51

Closed fbenedett closed 2 years ago

fbenedett commented 2 years ago

Dear all,

I am trying to analyze a dataset but, for unknown reasons, some of the sequences are not considered.

My code is the follow

%%capture
import sys
import pandas as pd
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_U

# Instantiate training object
DTCRU = DeepTCR_U('Tutorial')

a_target="MA0"
#Load Data from directories
DTCRU.Get_Data(directory='data_deep_tcr/'+a_target,Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)

#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False, size_of_net="small")
DTCRU.Cluster(clustering_method='phenograph', sample=500)
DFs = DTCRU.Cluster_DFs

r_df=pd.DataFrame()

for i in range(0, len(DFs)):
    tdf=DFs[i]
    tdf["cluster_index"]=i
    r_df=r_df.append(tdf)

fn="result_clustering_MA0_clean.txt" 
r_df.to_csv(fn)

In the directory "data_deep_tcr/MA0 I have two subfolders. Each of these subfolders contains one TSV file with the following format:

aminoAcid   counts  v_beta  j_beta
CASTHLDPPGEQYFG 571795  hTRBV28 hTRBJ02-7
CASSPLGASGEQFFG 317906  hTRBV28 hTRBJ02-1
CASGGGEQFFG 104692  hTRBV12-3   hTRBJ02-1
CANEGASENTEAFFG 86447   hTRBV06-1   hTRBJ01-1
CASSFFPFNEQFFG  74908   hTRBV12-3   hTRBJ02-1

For example, I pass this sequence (with the v_beta and j_beta and the counts) CANEGASENTEAFFG 73703 hTRBV06-8 hTRBJ01-1 but it is not clustered.

Whereas, the same sequence with a different count, v_beta and j_beta is clustered: CANEGASENTEAFFG 86447 hTRBV06-1 hTRBJ01-17

Any idea why is this happening?

sidhomj commented 2 years ago

So the latent representations that are learned are a joint representation of sequence + V/D/J gene usage if you supply this information. So the latent representations of those two sequences will actually be different because they use different V and J-genes. If you don't want to use V/D/J gene information, do not provide it to the model. The clustering algorithms provided in DeepTCR are "off the shelf" algorithms that I have not personally developed. I cannot speak to how they cluster data. One can always extract the learned features from the DeepTCR object after fitting the autoencoder and apply your own clustering algorithms in python. Learned features can be found after training under DTCR.features.

fbenedett commented 2 years ago

I made a mistake when I was explaining the problem. The sequences are not clustered because they are not even featurized.

In the following case I provide a single file where I checked that there are no double entries

%%capture
import sys
import numpy as np
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_U
import pandas as pd

# Instantiate training object
DTCRU = DeepTCR_U('Tutorial')

a_target="MA0_alltogether"
#Load Data from directories
DTCRU.Get_Data(directory='data_deep_tcr/'+a_target,Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)

Here I check the number of unique sequences

f=open("data_deep_tcr/MA0_alltogether/all_together.tsv", "r")
data=f.readlines()
f.close()
len(data)
27768
#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False, size_of_net="small")
len(DTCRU.features)
21910

About 6000 sequences are not featurized. And this happens also if I try to load the sequences in other ways.

sidhomj commented 2 years ago

DeepTCR removes certain sequences. In general, if the sequence is greater than the max_length parameter defined when instatiating the object (default is 40) or the sequences uses non-IUPAC letters, it is removed. You can find this logic below.

def Process_Seq(df,col):
    #Drop null values
    df = df.dropna(subset=[col])

    #strip any white space and remove non-IUPAC characters
    df[col] = df[col].str.strip()
    df = df[~df[col].str.contains(r'[^A-Z]')]
    iupac_c = set((list(IUPAC.IUPACProtein.letters)))
    all_c = set(''.join(list(df[col])))
    searchfor = list(all_c.difference(iupac_c))
    if len(searchfor) != 0:
        df = df[~df[col].str.contains('|'.join(searchfor))]
    return df
fbenedett commented 2 years ago

But something else is going on here... For example, I pass this sequence (with the v_beta and j_beta and the counts) CANEGASENTEAFFG 73703 hTRBV06-8 hTRBJ01-1 but it is not featurized.

Whereas, the same sequence with a different count, v_beta and j_beta is featurized: CANEGASENTEAFFG 86447 hTRBV06-1 hTRBJ01-1

It can't be the sequence. it should be something regarding the count or the v_beta or j_beta. But, if I understand correctly, v_beta and j_beta are just labels.

fbenedett commented 2 years ago

Please, reopen the issue. The fact that the same sequence with different v_beta j_beta is not featurize is still a problem.

sidhomj commented 2 years ago

I think I know what your issue is. If you pass the parameter (aggregate_by_aa = True), DeepTCR collapses all TCR's with the same amino acid sequence. In this case, the only one of the v/d/j labels is taken in this collapse. Try setting the parameter to aggregate_by_aa = False and it should keep both those sequences separate in the featurization. Let me know if this works.

fbenedett commented 2 years ago

Thank you. That was the problem.