sidhomj / DeepTCR

Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data
https://sidhomj.github.io/DeepTCR/
MIT License
113 stars 40 forks source link

The dataset used in the regression model #57

Closed zguo235 closed 2 years ago

zguo235 commented 2 years ago

Hello,

I checked the dataset used in the regression model. It seems that simply dropping duplicate TCR won't get the dataset used in the regression model. Could you tell you where I can find the preprocessing detail to obtain a dataset for the regression model?

Thanks!

sidhomj commented 2 years ago

Scripts to train regression models can be found under ancillary_analysis/supervised/supervised_reg/ under the following files. mart1_train.py, flu_train.py, ebv_train.py.

The csv file under Data/10x_Data/Data_Regression.csv already has no duplicates when looking at alpha/beta pairs.

zguo235 commented 2 years ago

Thank you for your prompt response. I have an in-house dataset and I want to train the regression model using my dataset. My dataset is like the counting matrix in the original 10x dataset, that each row is the UMI counts for one cell. I checked ancillary_analysis/supervised/supervised_reg/*_train.py files, but there is no description about the data preprocessing. How should I clean my dataset to get a file like Data/10x_Data/Data_Regression.csv to train the regression model?

sidhomj commented 2 years ago

Unfortunately, I am not able at this time to find the scripts I wrote to convert the 10x outputs to that csv file. But it should be rather simple to do with basic pandas functions.