pogorely / ALICE

Detecting TCR involved in immune responses from single RepSeq datasets
GNU General Public License v3.0
25 stars 13 forks source link

BUG: clones below read threshold have their neighbors undercounted by 1 #20

Open zacmon opened 11 months ago

zacmon commented 11 months ago

Hi Misha,

There's a bug in the following lines of code for the filter_data_thres_df function which counts the neighbors (specifically this function here):

tmp <- stringdistmatrix(df$CDR3.amino.acid.sequence,df$CDR3.amino.acid.sequence,method="hamming")
df$D=apply(tmp,MARGIN = 1,function(x)sum(x[df$Read.count>nei_read_thres]<=1,na.rm = T))-1

Clones with reads less than nei_read_thres have their neighbors counted; however, since they are already excluded in the bool (df$Read.count>nei_read_thres), the subtraction by 1 undercounts their neighbors. I'm not super familiar with R, but I imagine filling the diagonal of the Hamming distance matrix with some arbitrarily high number would resolve this problem instead of subtracting by 1.

tmp <- stringdistmatrix(df$CDR3.amino.acid.sequence,df$CDR3.amino.acid.sequence,method="hamming")
diag(tmp) <- rep(999, nrow(tmp))
df$D=apply(tmp,MARGIN = 1,function(x)sum(x[df$Read.count>nei_read_thres]<=1,na.rm = T))

Let me know your thoughts.

Thanks and take good care, Zach