sidhomj / DeepTCR

Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data
https://sidhomj.github.io/DeepTCR/
MIT License
110 stars 40 forks source link

Can´t load own Data using DTCR_SS.Get_Data #65

Closed junyho486 closed 2 years ago

junyho486 commented 2 years ago

TRB.txt I have TCRseq Data which was annotated by IGB and preprocessed for DeepTCR as indicated in the tutorial. I have 9 Samples with many TCRs, here is an excerpt of the Data for one Sample:

cdr3_aa v_call  d_call  j_call  Count
ASSARQDLQQY TRBV2*01    TRBD1*01    TRBJ2-7*01  39890
ASKDRALLRAV TRBV21-1*01 TRBD1*01    TRBJ2-7*01  32323
ASSFSATNTGELF   TRBV5-1*01  TRBD2*01    TRBJ2-2*01  26637
ASSPGEQNTGELF   TRBV7-8*01  TRBD2*01    TRBJ2-2*01  26258
ASSGAGTGGYNEQF  TRBV12-3*01 TRBD1*01    TRBJ2-1*01  16692
ASSFSGHTGELF    TRBV7-2*01  TRBD2*01    TRBJ2-2*01  13838
ASSVETGTEKY TRBV7-9*01  TRBD1*01    TRBJ2-3*01  13831
PPVIWTATSST TRBV24-1*01 TRBD1*01    TRBJ2-7*01  13819
ASSSGLAGAYEQY   TRBV7-2*02  TRBD2*01    TRBJ2-7*01  13216
ASSFGVSGANVLT   TRBV7-9*03  TRBD2*01    TRBJ2-6*01  11449
ASSGLAGGPGTGELF TRBV9*01    TRBD2*02    TRBJ2-2*01  11292
ASSPLAGGVAQF    TRBV7-6*01  TRBD2*02    TRBJ2-1*01  11019
ASSSTGQGNSYEQY  TRBV28*01   TRBD1*01    TRBJ2-7*01  10466

If I run the Tutorial using the example Data from the Repository for supervised Sequence Classification, loading Data, cluster etc. works perfectly (except for DTCR_SS.Train() which throws:

[AttributeError: 'DeepTCR_SS' object has no attribute 'test_pred']()

DTCR_SS.Monte_Carlo_CrossVal, DTCR_SS.K_Fold_CrossVal etc. work.

If I then replace the Folders in Data/Murine_Antigens with my Samples, DTCR_SS.Get_Data() which usually takes just a moment to load the data gets stuck (stopped it after 40min).

Even after only using TCRs >= 1000 Reads which results in Tables between 50-80 rows, does not resolve the issue.

import sys
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_SS

# Instantiate training object
DTCR_SS = DeepTCR_SS('Tutorial')

#Load Data from directories
DTCR_SS.Get_Data(directory='../../Data/TRB',Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=4,v_beta_column=1,j_beta_column=3)

Output:

Loading Data ...

Is there anything that could cause this kind of Bug?

Attached you will find the data for one Sample for TCR-seqs > 1000 (as .txt file saved .tsv)

Thank you in Advance for your help!

sidhomj commented 2 years ago

that bug should be fixed now in v 2.1.17.

As for the Get_Data method, it takes files that are csv/tsv. just change the extension of the file and it should work.

junyho486 commented 2 years ago

Well, I did a clean install today, like so:

  1. conda create -n DEEPTCR python=3.8.0
  2. conda activate DEEPTCR
  3. pip3 install DeepTCR // pip3 install git+https://github.com/sidhomj/DeepTCR.git -> same bug
  4. conda install ipykernel
  5. ipykernel install --user DEEPTCR

The Files do have .tsv format, I just changed to .txt for uploading them on github.

sidhomj commented 2 years ago

Sorry. are both bugs still there? or just the latter?

junyho486 commented 2 years ago

Thank you for the quick response!

I installed DeepTCR several times today trying to find the bug. Currently I am running the stable installation and see both bugs.

junyho486 commented 2 years ago

Edit: I reinstalled into a new env using pip3 install git+https://github.com/sidhomj/DeepTCR.git and the first bug seems to be resolved, but the second one persists.

sidhomj commented 2 years ago

second bug fixed. it was an issue with the expected order of columns in the files. I fixed the loading function so the order does not matter anymore. let me know if it works now!

junyho486 commented 2 years ago

Thank you so much! I was struggling with this one all day... Now both issues are resolved for the unsupervised and supervised model!

Ps: Also congrats for creating DeepTCR it is a very impressive tool!