sidhomj / DeepTCR

Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data
https://sidhomj.github.io/DeepTCR/
MIT License
113 stars 40 forks source link

Question about incomplete data #35

Closed emm1R closed 3 years ago

emm1R commented 3 years ago

Hi again, How does DeepTCR deal with columns that have some missing values? For example, if there are some TCRBs that are missing the D gene.

sidhomj commented 3 years ago

it represents missing information as that precisely. "unknown" is provided as the input into the model.

emm1R commented 3 years ago

I noticed that when aggregate_by_aa is True the code has agg_dict[col] = 'first'. Should the first TCR have missing values, wouldn't it be better to try and check if there is a TCR with values in all columns?

emm1R commented 3 years ago

Also, is it normal for two TCRBs to be assigned to two different clusters even though the only difference between them is that the other one is missing the D gene information?

sidhomj commented 3 years ago

Thank you for the suggestion! Will definitely consider how to implement this in future versions. However, at this time, since different sequencing platforms denote "unknown," we need to consider how to account for this. That being said, we do not expect a major difference in the results or two TCR's with identical sequences to fall very far in sequence space due to a different d-segment. If you think your data set really requires this to be correct, you can always parse the datas you like, and pass it to DeepTCR through the Load_Data method rather than the Get_Data method.