liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
282 stars 49 forks source link

confusions about reads count #48

Closed zhouxinseeu closed 3 years ago

zhouxinseeu commented 3 years ago

Hi,

I have analyzed a simple data by command: /SGRNJ03/randd/zhouxin/software/TRUST4/run-trust4 -t 10 -u ./NJU_Pt_2_2_3Nlib/r2.fq --barcode ./NJU_Pt_2_2_3Nlib/r1.fq --barcodeRange 0 23 + -f /SGRNJ03/randd/zhouxin/software/TRUST4/mouse/GRCm38_bcrtcr.fa --ref /SGRNJ03/randd/zhouxin/software/TRUST4/mouse/mouse_IMGT+C.fa -o nju_test --od ./nju_test I get results like:

$ ll nju_test/
total 4477464
-rwxr-xr-x. 1 zhouxin ssh.randd  159160893 Jun  8 22:00 nju_test_annot.fa
-rwxr-xr-x. 1 zhouxin ssh.randd  865050362 Jun  8 21:55 nju_test_assembled_reads.fa
-rwxr-xr-x. 1 zhouxin ssh.randd    4096927 Jun  8 22:00 nju_test_barcode_report.tsv
-rwxr-xr-x. 1 zhouxin ssh.randd   17091508 Jun  8 21:59 nju_test_cdr3.out
-rwxr-xr-x. 1 zhouxin ssh.randd  703247457 Jun  8 21:56 nju_test_final.out
-rwxr-xr-x. 1 zhouxin ssh.randd  703247457 Jun  8 21:55 nju_test_raw.out
-rwxr-xr-x. 1 zhouxin ssh.randd    2568020 Jun  8 22:00 nju_test_report.tsv
-rwxr-xr-x. 1 zhouxin ssh.randd  247015760 Jun  8 20:27 nju_test_toassemble_bc.fa
-rwxr-xr-x. 1 zhouxin ssh.randd 1299742745 Jun  8 20:27 nju_test_toassemble.fq 

Now, I want to calculate reads count of TRA and TRB. I have tried to sum reads count in nju_test_barcode_report.tsv and found it only accounts for 3% of assembled reads. From the results, the read count in nju_test_barcode_report.tsv is normalized, right? If I want to get the original read count of TRA and TRB, is there any good way?

Thank you for your time!

mourisl commented 3 years ago

Could you please show me how you sum the read count? Did you sum up the 7th subcolumn in the comma-splitter entries in the barcode_report file? Thank you.

zhouxinseeu commented 3 years ago

I extract read count from the 7th subcolumn in the chain1 and chain2. The code for extraction:

import pandas as pd

res = pd.read_csv('/SGRNJ03/randd/zhouxin/zx_test/TRUST4/nju_test/nju_test_barcode_report.tsv', sep='\t')

rows = res.shape[0]

TRB_Vs, TRB_Ds, TRB_Js, TRB_Cs, TRB_cdr3nts, TRB_cdr3aas, TRB_readcounts, TRB_fls = [], [], [], [], [], [], [], []

for i in range(rows):
    TRBs = res.loc[i, 'chain1']
    TRBs = TRBs.split(',')
    if len(TRBs) == 10:
        TRB_V, TRB_D, TRB_J, TRB_C, TRB_cdr3nt, TRB_cdr3aa, TRB_readcount, TRB_fl = TRBs[0], TRBs[1], TRBs[2], TRBs[3], TRBs[4], TRBs[5], TRBs[6], TRBs[-1]
        TRB_Vs.append(TRB_V)
        TRB_Ds.append(TRB_D)
        TRB_Js.append(TRB_J)
        TRB_Cs.append(TRB_C)
        TRB_cdr3nts.append(TRB_cdr3nt)
        TRB_cdr3aas.append(TRB_cdr3aa)
        TRB_readcounts.append(TRB_readcount)
        TRB_fls.append(TRB_fl)
    elif len(TRBs) != 10:
        TRB_Vs.append('NAN')
        TRB_Ds.append('NAN')
        TRB_Js.append('NAN')
        TRB_Cs.append('NAN')
        TRB_cdr3nts.append('NAN')
        TRB_cdr3aas.append('NAN')
        TRB_readcounts.append('NAN')
        TRB_fls.append('NAN')

res['TRB_V'] = TRB_Vs
res['TRB_D'] = TRB_Ds
res['TRB_J'] = TRB_Js
res['TRB_C'] = TRB_Cs
res['TRB_cdr3nt'] = TRB_cdr3nts
res['TRB_cdr3aa'] = TRB_cdr3aas
res['TRB_readcount'] = TRB_readcounts
res['TRB_fl'] = TRB_fls

The new result is like:

                  #barcode cell_type  \
0  TAGCTTGTTCGAAGTGTTCTGTGT       abT   
1  ACAGTGGTTCTGCTGTTTGGTATG       abT   
2  CGATGTTTTTGCGTACTATGTGGC       abT   
3  TGACCACTGGCTACAGTCAGGAGG       abT   
4  CGATGTTTTCCTCAATTGTGGTTG       abT   

                                              chain1  \
0  TRBV29*01,TRBD2*01,TRBJ2-1*01,TRBC1,TGTGCTAGCA...   
1  TRBV14*01,TRBD2*01,TRBJ2-4*01,TRBC1,TGTGCCAGCA...   
2  TRBV17*01,TRBD1*01,TRBJ2-1*01,TRBC1,TGTGCTAGCA...   
3  TRBV14*01,TRBD2*01,TRBJ2-4*01,TRBC1,TGTGCCAGCA...   
4  TRBV14*01,TRBD2*01,TRBJ2-4*01,TRBC1,TGTGCCAGCA...   

                                              chain2  \
0  TRAV9N-3*01,*,TRAJ30*01,TRAC,TACTTCTGTGCTGTGAG...   
1  TRAV7-5*01,*,TRAJ48*01,TRAC,TACCTCTGTGCAGTGTTG...   
2  TRAV16N*01,*,TRAJ34*02,TRAC,TATTTCTGTGCTATGAGG...   
3  TRAV9D-4*04,*,TRAJ32*01,TRAC,TACTTCTGTGCTGTGAG...   
4  TRAV6-6*01,*,TRAJ37*01,TRAC,TACTACTGTGCTCTGGAT...   

                                    secondary_chain1  \
0  TRBV14*01,TRBD2*01,TRBJ2-4*01,TRBC1,TGTGCCAGCA...   
1  TRBV31*01,TRBD2*01,TRBJ2-4*01,TRBC1,TGTGCCTGGG...   
2  TRBV16*01,TRBD1*01,TRBJ2-5*01,TRBC1,TGTGCAAGCA...   
3  TRBV14*01,TRBD2*01,TRBJ2-2*01,TRBC1,TGTGCCAGCA...   
4                                                  *   

                                    secondary_chain2      TRB_V     TRB_D  \
0  TRAV3N-3*01,*,TRAJ18*01,TRAC,TACTTCTGCGCAGAAGG...  TRBV29*01  TRBD2*01   
1  TRAV6-6*01,*,TRAJ37*01,TRAC,TACTACTGTGCTCTGGAT...  TRBV14*01  TRBD2*01   
2  TRAV4N-4*01,*,TRAJ21*01,TRAC,TACTTCTGTGCTCCCAT...  TRBV17*01  TRBD1*01   
3  TRAV6D-6*02,*,TRAJ37*01,TRAC,TACTACTGTGCTCTGGA...  TRBV14*01  TRBD2*01   
4  TRAV6-6*01,*,TRAJ22*01,TRAC,TACTACTGTGCTCTGGGT...  TRBV14*01  TRBD2*01   

        TRB_J  TRB_C  ...   TRB_readcount TRB_fl        TRA_V TRA_D  \
0  TRBJ2-1*01  TRBC1  ...           14.18      1  TRAV9N-3*01     *   
1  TRBJ2-4*01  TRBC1  ...            7.10      1   TRAV7-5*01     *   
2  TRBJ2-1*01  TRBC1  ...           17.60      1   TRAV16N*01     *   
3  TRBJ2-4*01  TRBC1  ...           91.60      1  TRAV9D-4*04     *   
4  TRBJ2-4*01  TRBC1  ...            3.79      0   TRAV6-6*01     *   

       TRA_J TRA_C                                         TRA_cdr3nt  \
0  TRAJ30*01  TRAC  TACTTCTGTGCTGTGAGCATGAAGGGCACAAATGCTTACAAAGTCA...   
1  TRAJ48*01  TRAC      TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT   
2  TRAJ34*02  TRAC   TATTTCTGTGCTATGAGGGGCTCTTCCAATACCAACAAAGTCGTCTTT   
3  TRAJ32*01  TRAC  TACTTCTGTGCTGTGAGGACGAATTATGGGAGCAGTGGCAACAAGC...   
4  TRAJ37*01  TRAC   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT   

           TRA_cdr3aa TRA_readcount TRA_fl  
0   YFCAVSMKGTNAYKVIF         24.31      0  
1     YLCAVLNYGNEKITF         13.73      1  
2    YFCAMRGSSNTNKVVF          4.00      1  
3  YFCAVRTNYGSSGNKLIF         79.20      0  
4    YYCALDITGNTGKLIF         21.04      0  

[5 rows x 22 columns]

The code for sum:

import numpy as np

trbreads = [float(i) for i in res['TRA_readcount'].tolist()]

npl = np.array(trbreads, dtype=np.float)
npl = npl[~np.isnan(npl)]
npl.sum()

Is there any problem with my process?

mourisl commented 3 years ago

Your script looks fine for me. I assume this is a 5' 10x kit. Then most of the reads will be on the 5' end, which is in the V gene region, and TRUST4 will assemble those regions too. So it is normal that only a small fraction of assembled reads can fall in the CDR3 region.

I just noticed an issue in your command, for a typical 5' 10x kit, the barcode range should be "0 15 +", and the other 8 nucleotides are for UMI. If you use all of them, it will overcount the number of cells.

zhouxinseeu commented 3 years ago

Thank you for your quick reply.

So you mean read count in the 7th subcolumn in the comma-splitter entries of barcode report file aims to CDR3 region, not TRA/TRB full length including V(D),J region?

As for the barcode problem you mentioned, the kit I used is not 10X genomics, but Singleon GEXSCOPE Immune Repertoire Profiling kit, of which barcode is 24 bases. It captures mRNA from the 3'end. Does this also affect the results?

mourisl commented 3 years ago

Yes, the 7th column is only the CDR3 read count. I don't know how much reads from GEXSCOPE 3' kit are fall into the 3' end, like in constant gene. TRUST4 assembles part of the constant gene too, so those assembled reads could from J and constant genes. From my experience with 10x data, 3' kit has very few read covering the CDR3 region, so the sensitivity is much lower comparing with using 5' kit.

zhouxinseeu commented 3 years ago

Thank you for your suggestion. I will follow this problem.

mourisl commented 3 years ago

By rereading your file, the few lines in the barcode_report file have both chains, which are very rare in 10x 3' kit. So I'm wondering did you apply TRUST4 on the GEXSCOPE 3'kit data for the transcriptome data part or the immune profiling data part? If it is immune profiling, then it might be the case that many assembled reads are in the V or constant gene region. Thank you.

zhouxinseeu commented 3 years ago

Hi, mouris!

Sorry for the late reply. My data is the immune profiling data, not transcriptome data. Does trust4 need special process or cmd for immune profiling data? I saw that trust4 has --repseq argument. Is it for immune profiling data?

By the way, when I checked the results of trust4, I found that one cell (barcode) failed for both full length TRA and TRB chains in the barcode_report.tsv. However, I can find full length TRA and TRB chains of the same cell (barcode) in annot.fa. I compared the read count of cdr3 in these two results, and found that chain1 and chain2 in barcode_report.tsv have the highest number of reads. Do you keep TRA or TRB based on the number of cdr3 reads?

Thank you!

mourisl commented 3 years ago

Since TCR/BCR-seq data may have too many reads and TRUST4 can be very slow ,--repseq is just for speed up,

Could you please share the two chains' assemblies in the _annot.fa file with its header? One possible reason is that the CDR3 is not annotated correctly, and got filtered out when generating barcode_report.tsv file. Within the same barcode, chain1 and chain2 should have the highest number of supporting read, and other CDR3 with fewer read supporting would be in the last two columns.

zhouxinseeu commented 3 years ago

Chains head in annot.fa TRA chain: >GCCAATGTTATGTGGCGACCTTAG_9059 381 11.10 TRAV7-5*01(277):(0-176):(96-272):96.61,TRAV7D-5*01(277):(0-176):(96-272):96.61 * TRAJ48*01(61):(180-237):(3-60):100.00 TRAC(935):(238-380):(0-142):100.00 CDR1(0-0):0.00=null CDR2(48-65):100.00=TCCATCTTCTCTGATGGT CDR3(162-206):100.00=TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT TTCAGGTGGTACAGACAGCATTCTGCGAAAGGCCTTGAGGTGCTAGTGTCCATCTTCTCTGATGGTGAAAAGGAAGAAGGCAGATTTACAGCTCACCTCAATAGAGCCAACTTGCATGTTTCCCTACACATCAGAGAACCACAACCCAGTGACTCTGCTGTCTACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTTGGGGCTGGAACCAAACTCACCATTAAACCCAACATCCAGAACCCAGAACCTGCTGTGTACCAGTTAAAAGATCCTCGGTCTCAGGACAGCACCCTCTGCCTGTTCACCGACTTTGACTCCCAAATCAATGTGCCGAAAACCATGGAATCTGGAACGTTCATCACTGACAAAACT

TRB chain: >GCCAATGTTATGTGGCGACCTTAG_20225 276 30.59 TRBV14*01(289):(0-104):(181-285):100.00 TRBD2*01(14):(106-110):(4-8):100.00 TRBJ2-4*01(49):(113-161):(0-48):100.00 TRBC1(737):(162-261):(0-99):100.00,TRBC2(692):(162-261):(0-99):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(92-133):100.00=TGTGCCAGCAGTTCCTGGGTCAGTCAAAACACCTTGTACTTT CCTCGGATCGATTTTCTGCTGTGAGGCCTAAAGGAACTAACTCCACTCTCAAGATCCAGTCTGCAAAGCAGGGCGACACAGCCACCTATCTCTGTGCCAGCAGTTCCTGGGTCAGTCAAAACACCTTGTACTTTGGTGCGGGCACCCGACTATCGGTGCTAGAGGATCTGAGAAATGTGACTCCACCCAAGGTCTCCTTGTTTGAGCCATCAAAAGCAGAGATTGCAAACAAACAAAAGGCTACCCTCGTGTGCTTGGCCAGATCGGAAGAGCACA

Full length chain but not in barcode_report.tsv: TRA: >GCCAATGTTATGTGGCGACCTTAG_2199 543 19.02 TRAV6D-6*02(287):(78-360):(0-282):100.00,TRAV6-6*03(282):(81-360):(0-279):97.14 * TRAJ37*01(60):(364-423):(0-59):100.00 TRAC(935):(424-542):(0-118):95.80 CDR1(153-176):100.00=TCAGCCACAAGCATAGCTTACCCT CDR2(228-248):100.00=AAAGTCATTACGGCTGGCCAG CDR3(345-392):100.00=TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT ACTTTCTAGATGACACTAAAGATGGACTCTTCTCCAGGCTTCGTGGCTGTGATACTTCTCATACTTGGAAGGACCCACGGAGATTCCGTGACTCAAACAGAAGGCCCAGTGACCGTCTCAGAAAGCGAGTCCCTGATAATAAATTGCACGTATTCAGCCACAAGCATAGCTTACCCTAATCTTTTCTGGTATGTTCGATATCCTGGAGAAGGTCTACAACTCCTCCTGAAAGTCATTACGGCTGGCCAGAAGGGAAGCAGCAGAGGGTTTGAAGCCACATACAATAAAGAAACCACCTCCTTCCACTTGCAGAAAGCCTCAGTGCAAGAGTCAGACTCGGCTGTGTACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTTGGACTGGGGACAACTTTACAAGTGCAACCAGACATCCAGAACCAAGAACCTGCTGTGTACCAGTTAAAAGATCCTCGGTCTCAGGACAGCACCCTCTGCCTGTTCACAGACTTTGACTCCCAAATCAATGTGCAGAAACCCAGGGAATCT

TRB: >GCCAATGTTATGTGGCGACCTTAG_24585 580 18.56 TRBV19*01(286):(166-451):(0-285):100.00 * TRBJ2-4*01(49):(454-502):(0-48):100.00 TRBC1(737):(503-579):(0-76):100.00,TRBC2(692):(503-579):(0-76):100.00 CDR1(244-258):100.00=TTCAATCATGATACA CDR2(310-327):100.00=TCAATAACTGAAAACGAT CDR3(436-474):100.00=TGTGCCAGCAGTATAGTCAGTCAAAACACCTTGTACTTT ATTCTAGTCATCATAGTTAGCCAGGTGGAGCCCCTGCATAGCGCGTCTCCTTGTTTCTCTTTTAACTAATGCCCAGAGCCAAAGAAAGTCCCTCCAAACTATGAACAAGTGGGTTTTCTGCTGGGTAACCCTTTGTCTCCTTACTGTAGAGACCACACATGGTGATGGTGGCATCATTACTCAGACACCCAAATTCCTGATTGGTCAGGAAGGGCAAAAACTGACCTTGAAATGTCAACAGAATTTCAATCATGATACAATGTACTGGTACCGACAGGATTCAGGGAAAGGATTGAGACTGATCTACTATTCAATAACTGAAAACGATCTTCAAAAAGGCGATCTATCTGAAGGCTATGATGCGTCTCGAGAGAAGAAGTCATCTTTTTCTCTCACTGTGACATCTGCCCAGAAGAACGAGATGGCCGTTTTTCTCTGTGCCAGCAGTATAGTCAGTCAAAACACCTTGTACTTTGGTGCGGGCACCCGACTATCGGTGCTAGAGGATCTGAGAAATGTGACTCCACCCAAGGTCTCCTTGTTTGAGCCATCAAAAGCAGAGATTGCAAACAAACAAAAG

zhouxinseeu commented 3 years ago

As you noticed that chain 1 and chain 2 should have the most reads in the barcode_report.tsv, if I filter the results by average_coverage in annot.fa, I should get the same results as barcodr_report.tsv, right?

mourisl commented 3 years ago

The actual abundance is in the third last column of the _cdr3.out file, where TRUST4 will realign the assembled reads to quantify each CDR3. Are you using the newest version of TRUST4, there could be a wrong abundance estimation in some earlier versions.

zhouxinseeu commented 3 years ago

Well, my version is 1.0.2. I will update it to the newest version and check the results again.

zhouxinseeu commented 3 years ago

I have downloaded TRUST4 of v1.0.4, but the version in the software is still v1.0.2.

$ ls TRUST4-1.0.4

alignments.hpp    BuildDatabaseFa.pl       FastqExtractor.o            KmerCount.hpp  ReadFiles.hpp     trust-barcoderep-to-10X.pl
annotator         BuildImgtAnnot.pl        FilterAnnotatedAssembly.pl  KmerIndex.hpp  README.md         trust-cluster.py
Annotator.cpp     BuildImgtVquestAnnot.pl  GetFullLengthAssembly.pl    kseq.h         run-trust4        trust-simplerep.pl
Annotator.o       Cgene.list               hg19_bcrtcr.fa              LICENSE.txt    samtools-0.1.19   trust-smartseq.pl
bam-extractor     defs.h                   hg38_bcrtcr.fa              main.cpp       SeqSet.hpp        trust-stats.py
BamExtractor.cpp  example                  human_IMGT+C.fa             main.o         SimpleVector.hpp
BamExtractor.o    fastq-extractor          human_vdjc.list             Makefile       trust4

$ run-trust4

TRUST4 v1.0.2 usage: ./run-trust4 [OPTIONS]:
Required:
        -b STRING: path to bam file
        -1 STRING -2 STRING: path to paired-end read files
        -u STRING: path to single-end read file
        -f STRING: path to the fasta file coordinate and sequence of V/D/J/C genes
Optional:
        --ref STRING: path to detailed V/D/J/C gene reference file from IMGT database. (default: not used but recommended)
        -o STRING: prefix of output files. (default: inferred from file prefix)
        --od STRING: the directory for output files. (default: ./)
        -t INT: number of threads (default: 1)
        --barcode STRING: if -b, bam field for barcode; if -1 -2/-u, file containing barcodes (default: not used)
        --barcodeRange INT INT CHAR: start, end(-1 for length-1), strand in a barcode is the true barcode (default: 0 -1 +)
        --barcodeWhitelist STRING: path to the barcode whitelist (default: not used)
        --read1Range INT INT: start, end(-1 for length-1) in -1/-u files for genomic sequence (default: 0 -1)
        --read2Range INT INT: start, end(-1 for length-1) in -2 files for genomic sequence (default: 0 -1)
        --mateIdSuffixLen INT: the suffix length in read id for mate. (default: not used)
        --skipMateExtension: do not extend assemblies with mate information, useful for SMART-seq (default: not used)
        --abnormalUnmapFlag: the flag in BAM for the unmapped read-pair is nonconcordant (default: not set)
        --noExtraction: directly use the files from provided -1 -2/-u to assemble (default: extraction first)
        --repseq: the data is from TCR-seq or BCR-seq (default: not set)
        --stage INT: start TRUST4 on specified stage (default: 0):
                0: start from beginning (candidate read extraction)
                1: start from assembly
                2: start from annotation
                3: start from generating the report table
mourisl commented 3 years ago

My bad, I forgot to update the version number in the wrapper, and the code should be 1.0.4. I just fixed this version number issue. Thank you.

zhouxinseeu commented 3 years ago

I have checked the new results from TRUST4 v1.0.4.

I got full length assembly fasta file by GetFullLengthAssembly.pl. Then I filtered the full length cells (barcodes) by average_coverage in comment. I got 1577 cells finally.

Also, I filtered _cdr3.out by read_fragment_count and full_length_assembly. I kept the most read_fragment_count and full length assembly. I got 1578 cells.

I checked the different cell and found it lost in the full length assembly fasta file, which means this cell is marked as full length in _cdr3.out but no full length assembly.

I found this sequence in _annot.fa:

>TGACCACTTGACCACTTACCACCA_47779 434 19.54 TRAV16*01(290):(132-238):(0-106):100.00 * TRAJ4*02(63):(319-380):(1-62):100.00 TRAC(935):(381-433):(0-52):100.00 CDR1(207-227):100.00=GAAACCCAGGACAGTTCTTAC CDR2(0-0):0.00=null CDR3(299-349):50.00=TATTTCTGTGCTATGAGAGGGTTATCTGGTAGCTTCAATAAGTTGACCTTT
ACAAAGTATAGATTAAGATCTGGTTGAGAGAGAACATAGACAACTTCACTCAAGACCAGAGCTAACAGTATGCTGATTCTAAGCCTGTTGGGAGCTCCATTTTTTGGCTCCATTTGTTTTGCAACCAGCATGGCCCAGAAGGTAACACAGACTCAGACTTCAATTTCTGTGATGGAGAAGACAACGGTGACAATGGACTGTGTGTATGAAACCCAGGACAGTTCTTACTTCTTATTCTGAAGCCAAAAAGTTCCATCGGACTCATCATCACCGCCACACAGATTGAGGACTCGGCAGTATATTTCTGTGCTATGAGAGGGTTATCTGGTAGCTTCAATAAGTTGACCTTTGGAGCAGGGACCAGACTGGCTGTGTGCCCATACATCCAGAACCCAGAACCTGCTGTGTACCAGTTAAAAGATCCTCGGTCTCAG

cdr3.out:

    consensus_id    index_within_consensus  V_gene  D_gene  J_gene  C_gene  CDR1    CDR2    CDR3    CDR3_score  read_fragment_count CDR3_germline_similarity    full_length_assembly    barcode
17610   TGACCACTTGACCACTTACCACCA_47779  0   TRAV16*01   *   TRAJ4*02    TRAC    GAAACCCAGGACAGTTCTTAC   *   TATTTCTGTGCTATGAGAGGGTTATCTGGTAGCTTCAATAAGTTGA...   0.5 1.51    0.0 1   TGACCACTTGACCACTTACCACCA

Its CDR2 region is null. Is it a mistake for this sequence to be marked as full length assembly?

mourisl commented 3 years ago

Right, this one should not be marked as full length. I just found and fix the bug in the annotator. You can run TRUST4 with the option --stage 2 so it can restart from the annotation step, which should give you the correct full-length labeling. Thank you!

zhouxinseeu commented 3 years ago

I am still confused about the barcode_report.tsv. I want to know if barcode_report.tsv is related to cdr3.out. According to your statement, select the chain with the most reads to keep, then the results of chain1 and chain2 in barcode_report.tsv should be consistent with the filtering results of cdr3.out, but now it is found that the results of the two are not consistent. I don't know if there are any problems with my thoughts, and I hope you can answer my doubts.

top10 clone types in barcode_report.tsv:

TRA_cdr3aa  TRB_cdr3aa    Frequent
YLCAVLNYGNEKITF CASSSWGSQNTLYF  244
YYCALDITGNTGKLIF    CASSFWGNQDTQYF  166
CASSSWGSQNTLYF  91
YYCALRANYGSSGNKLIF  CASSLEPNYAEQFF  30
YYCALDITGNTGKLIF    CASSRDRGRAEQFF  30
YFCAVRGSGGSNAKLTF   CASSRDRGRAEQFF  28
YYCAMSEGYWQWWKTHF   CASSSWGSQNTLYF  17
YLCALEPDSGSWQLIF    CASSQRADSAETLYF 17
YYCAMSEGYWQWWKTHF   CASSFWGNQDTQYF  15
YYCALDITGNTGKLIF    CASSLEPNYAEQFF  14

top10 clone types in cdr3.out:

CDR3_x             CDR3_y    Frequent
YLCAVLNYGNEKITF    CASSSWGSQNTLYF          204
YYCALDITGNTGKLIF   CASSFWGNQDTQYF          153
                   CASSSWGSQNTLYF           73
YHCILRVKEGLRSSS    CASSSWGSQNTLYF           45
YFCAVRGSGGSNAKLTF  CASSRDRGRAEQFF           26
YYCALDITGNTGKLIF   CASSRDRGRAEQFF           24
YLCALEPDSGSWQLIF   CASSQRADSAETLYF          16
YYCALRANYGSSGNKLIF CASSLEPNYAEQFF           16
                   CAWGDKKTPCT              16
YYCAMSEGYWQWWKTHF  CASSFWGNQDTQYF           15

Thank you!

zhouxinseeu commented 3 years ago

There are still CDR1 null value in full length fasta file generated by GetFullLengthAssembly.pl. Is this correct? By the way, what does average_coverage specifically mean?

mourisl commented 3 years ago

Could you please share the _cdr3.out file and your script for generating the two tables. I will look into this. Thank you.

zhouxinseeu commented 3 years ago

partial cdr3.out:

ATCACGTTTACTAGTCTGCGATCT_0      0       TRAV6-6*01,TRAV6-7/DV9*01,TRAV6D-5*01   *       TRAJ37*01       TRAC    *       *       TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    3.71    100.00  0
TAGCTTGTTCGAGCGTGCATGGCT_1      0       TRAV9D-4*04     *       TRAJ42*01       TRAC    TCCTACTCTGGGACGCCT      AAGTACTATTCCGGAGACCCAGTG
        TACTTCTGTGCTGTGAGGGGTTCTGGAGGAAGCAATGCAAAGCTAACCTTC     1.00    289.82  100.00  1
CGATGTTTTGTACCTTTGAAGCCA_2      0       TRAV6D-6*02     *       TRAJ57*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACCACTGTATCCTGAGAGTAAAGGAGGGTCTGCGAAGCTCATCTTT 1.00    338.23  96.97   1
GATCAGCGTGTATGCGGCTCCTTG_4      0       TRAV6D-6*02     *       TRAJ48*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCGGAAATGAGAAAATAACTTTT        1.00    13.85   100.00  1
GATCAGCGTGTACCTTTTGTTCCA_5      0       TRAV6D-6*02     *       TRAJ37*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    134.88  100.00  1
CAGATCTGTGACCACTGGTCGTGT_9      0       TRAV6D-6*02,TRAV6-7/DV9*01,TRAV6-6*01   *       TRAJ37*01       TRAC    *       *       TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    2.69    100.00  0
TGACCACTGCAATCCGTAAGTTCG_10     0       TRAV6D-6*02     *       TRAJ37*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    158.44  100.00  1
CGATGTTTGACGGATTGATAGAGG_11     0       TRAV6-6*01,TRAV6-7/DV9*01,TRAV6D-5*01   *       TRAJ37*01       TRAC    *       *       TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    2.33    100.00  0
TGACCACTTCGTTAGCTACCGAGC_12     0       *       *       TRAJ37*01       TRAC    *       *       TGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT      0.00    1.12    0.00    0
GCCAATGTTAGACGGATAAGTTCG_13     0       TRAV7-5*01,TRAV7D-5*01  *       TRAJ48*01       TRAC    AGTGATGGTACTTCTAAC      TCCATCTTCTCTGATGGT      TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT   1.00    50.18   97.50   1
TTAGGCATGATAGAGGGAATCTGT_15     0       TRAV6D-6*02     *       TRAJ37*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    92.12   100.00  1
CAGATCTGTCTCGGTTTTGACTCT_16     0       TRAV6D-6*02     *       TRAJ37*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    90.69   100.00  1
TTAGGCATTGACCACTGATCAGCG_18     0       TRAV6D-6*02     *       TRAJ37*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    99.06   100.00  1
TGACCACTGCAATCCGTATGTGGC_19     0       TRAV6D-6*02     *       TRAJ37*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    12.73   100.00  1
CGATGTTTGGTCGTGTTGTTCTCC_20     0       TRAV6D-6*02     *       TRAJ37*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    7.71    100.00  1
GCCAATGTGTGTCCTTTGCATAGT_21     0       TRAV6D-6*02     *       TRAJ37*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    4.77    100.00  1
GCCAATGTTGTTCTCCTGTGGTTG_22     0       TRAV6D-6*02     *       TRAJ42*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TGTGCTGTGAGGGGTTCTGGAGGAAGCAATGCAAAGCTAACCTTC   0.50    100.56  0.00    0
TTAGGCATGACGGATTTCGAGCGT_23     0       TRAV6D-6*02     *       TRAJ37*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    69.96   100.00  1
CGATGTTTTACCGAGCTAACGCTG_24     0       TRAV7-5*01,TRAV7D-5*01  *       TRAJ48*01       TRAC    AGTGATGGTACTTCTAAC      TCCATCTTCTCTGATGGT      TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT   1.00    3.56    97.50   1
ACTTGATGTCCGTCTTGACGGATT_25     0       TRAV6D-6*02     *       TRAJ37*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    12.31   100.00  0
ACAGTGGTGCACTGTCTCATCCTA_27     0       TRAV6D-6*02     *       TRAJ26*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATGCCCAGGGATTAACCTTC        1.00    5.23    100.00  1
GCCAATGTTACAGGATGAGGTGCT_28     0       TRAV6D-6*02     *       TRAJ37*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    14.15   100.00  1
TGACCACTTCGAAGTGCAGATCTG_29     0       TRAV6D-6*02     *       TRAJ37*01       TRAC    TCAGCCACAAGCATAGCTTACCCT        AAAGTCATTACGGCTGGCCAG   TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT        1.00    136.98  100.00  1

code for cdr3.out:

cdr3_out = pd.read_csv('/SGRNJ03/randd/zhouxin/zx_test/trust_test/NJU_Pt_2_2_3Nlib_1.0.4//02.trust_assemble/TRUST4/NJU_Pt_2_2_3Nlib_1.0.4_cdr3.out', sep='\t', names=['consensus_id','index_within_consensus','V_gene','D_gene','J_gene','C_gene','CDR1','CDR2','CDR3','CDR3_score','read_fragment_count','CDR3_germline_similarity','full_length_assembly'])

cdr3_out['barcode'] = cdr3_out['consensus_id'].apply(lambda x: x.split('_')[0])
#cdr3_out = cdr3_out[cdr3_out['full_length_assembly']==1]

tra = cdr3_out[cdr3_out['V_gene'].str.contains('TRA', na=False)]
tra = tra.sort_values(by='read_fragment_count', ascending=False)
tra = tra.drop_duplicates('barcode', 'first')

trb = cdr3_out[cdr3_out['V_gene'].str.contains('TRB', na=False)]
trb = trb.sort_values(by='read_fragment_count', ascending=False)
trb = trb.drop_duplicates('barcode', 'first')

cdr3_res = pd.merge(tra, trb, on='barcode', how='outer').fillna('None')
cdr3_res['CDR3_x'] = cdr3_res['CDR3_x'].apply(lambda x: str(Seq(x).translate()) if (x != 'None') else 'None')
cdr3_res['CDR3_y'] = cdr3_res['CDR3_y'].apply(lambda x: str(Seq(x).translate()) if (x != 'None') else 'None')

res2 = cdr3_res.groupby(['CDR3_x', 'CDR3_y']).agg({'barcode': 'count'})
res2 = res2.sort_values(by='barcode', ascending=False)
res2

code for barcode_report.tsv:

def beauty_res(outdir, barcode_report):
    res = pd.read_csv(barcode_report, sep='\t')
    rows = res.shape[0]
    loci = ['A', 'B']
    chians = ['chain2', 'chain1']
    for l in range(len(loci)):
        chain = chians[l]
        locus = loci[l]

        Vgenes, Dgenes, Jgenes, Cgenes, cdr3nts, cdr3aas, readcounts, fuls = [], [], [], [], [], [], [], []

        for i in range(rows):
            attr = res.loc[i, chain]
            attrs = attr.split(',')
            if len(attrs) == 10:
                V, D, J, C, cdr3nt, cdr3aa, readcount, fl = attrs[0], attrs[1], attrs[2], attrs[3], attrs[4], attrs[5], attrs[6], attrs[-1]
                Vgenes.append(V)
                Dgenes.append(D)
                Jgenes.append(J)
                Cgenes.append(C)
                cdr3nts.append(cdr3nt)
                cdr3aas.append(cdr3aa)
                readcounts.append(readcount)
                fuls.append(fl)
            elif len(attrs) != 10:
                Vgenes.append('NAN')
                Dgenes.append('NAN')
                Jgenes.append('NAN')
                Cgenes.append('NAN')
                cdr3nts.append('NAN')
                cdr3aas.append('NAN')
                readcounts.append('NAN')
                fuls.append('NAN')

        res[f'TR{locus}_V'] = Vgenes
        res[f'TR{locus}_D'] = Dgenes
        res[f'TR{locus}_J'] = Jgenes
        res[f'TR{locus}_C'] = Cgenes
        res[f'TR{locus}_cdr3nt'] = cdr3nts
        res[f'TR{locus}_cdr3aa'] = cdr3aas
        res[f'TR{locus}_readcount'] = readcounts
        res[f'TR{locus}_fl'] = fuls

    res.to_csv(f'{outdir}/new_barcode_report.tsv', sep='\t')

    return res

barcode_report = pd.read_csv('new_barcode_report.tsv', index_col=0, sep='\t')
res1 = barcode_report.groupby(['TRA_cdr3aa', 'TRB_cdr3aa']).agg({'#barcode': 'count'})
res1 = res1.sort_values(by='#barcode', ascending=False)
res1

If you need the complete file, please let me know and I can send it to your mailbox.

mourisl commented 3 years ago

I think the difference comes from the imputed CDR3. If two cells have the same TRB CDR3, and one of the cell has complete TRA CDR3 and the other cell has partial TRA CDR3, if the partial CDR3 is a substring from the complete one and the identified gene cassette usage is the same, TRUST4 will impute the remaining sequences by borrowing the information from another cell. In this case, the consensus id from the barcode_report file will tell which sequence it is imputed from. I tried your script on my own data, and the numbers match by disregarding the imputed CDR3. You can run the "perl trust-barcoderep.pl " with the option "--noImputation" to disable this feature.

zhouxinseeu commented 3 years ago

I have tried your suggestion. The result is much better than before, but there are still some inconsistencies. Regarding the final result, which one do you think is more reliable? barcode_report or cdr3.out?

new results:

TRA_cdr3aa  TRB_cdr3aa    Frequent
YLCAVLNYGNEKITF    CASSSWGSQNTLYF        242
YYCALDITGNTGKLIF   CASSFWGNQDTQYF        163
                   CASSSWGSQNTLYF         87
YYCALRANYGSSGNKLIF CASSLEPNYAEQFF         30
YYCALDITGNTGKLIF   CASSRDRGRAEQFF         29
YFCAVRGSGGSNAKLTF  CASSRDRGRAEQFF         28
YLCALEPDSGSWQLIF   CASSQRADSAETLYF        17
YYCAMSEGYWQWWKTHF  CASSFWGNQDTQYF         15
                   CASSSWGSQNTLYF         15
YYCALDITGNTGKLIF   None                   14

previous results:

TRA_cdr3aa  TRB_cdr3aa    Frequent
YLCAVLNYGNEKITF CASSSWGSQNTLYF  244
YYCALDITGNTGKLIF    CASSFWGNQDTQYF  166
     CASSSWGSQNTLYF       91
YYCALRANYGSSGNKLIF  CASSLEPNYAEQFF  30
YYCALDITGNTGKLIF    CASSRDRGRAEQFF  30
YFCAVRGSGGSNAKLTF   CASSRDRGRAEQFF  28
YYCAMSEGYWQWWKTHF   CASSSWGSQNTLYF  17
YLCALEPDSGSWQLIF    CASSQRADSAETLYF 17
YYCAMSEGYWQWWKTHF   CASSFWGNQDTQYF  15
YYCALDITGNTGKLIF    CASSLEPNYAEQFF  14
zhouxinseeu commented 3 years ago

I have found the reason. I lost the option --reportPartial. Now I can output the clone types table by analyzing barcode_report.tsv. Another question is that there are still some CDR1 missing in the full length assembly mentioned above. Is this normal? Thank you!

mourisl commented 3 years ago

Can you share the missing CDR1 assembly from the _annot.fa file? One possibility is that the assembly had too many mismatches, deletions and it could identify CDR1 region based on IMGT guideline.

zhouxinseeu commented 3 years ago

Hi, here are some assemblies with missing CDR1:

>TAGCTTGTGCAACATTTACCACCA_9114 413 2.10 TRAV7-5*03(163):(8-166):(0-158):100.00,TRAV7D-5*01(277):(0-166):(106-272):96.41 * TRAJ48*01(61):(170-227):(3-60):100.00 TRAC(935):(228-408):(0-164):86.71 CDR1(0-0):0.00=null CDR2(38-55):100.00=TCCATCTTCTCTGATGGT CDR3(152-196):100.00=TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT
ACAGACAGCATTCTGCGAAAGGCCTTGAGGTGCTAGTGTCCATCTTCTCTGATGGTGAAAAGGAAGAAGGCAGATTTACAGCTCACCTCAATAGAGCCAACTTGCATGTTTCCCTACACATCAGAGAACCACAACCCAGTGACTCTGCTGTCTACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTTGGGGCTGGAACCAAACTCACCATTAAACCCAACATCCAGAACCCAGAACCTGCTGTGTACCAGTTAAAAGATCCTCGGTCTCAGGACAGCACCCTCTGCCTGTTCACCGACTTTGAATCCCTAATCAATGTTCAGAATACCATTGTATATTGAACGTTCATATCTGTCATTACTGTGATGGACATGAAATCTTTTGATTCAATGAGCTATGGTGCC
>ATCACGTTGCTCCTTGCAGATCTG_9130 274 0.60 TRAV7-5*03(163):(5-163):(0-158):100.00,TRAV7D-5*01(277):(0-163):(109-272):96.34,TRAV7N-5*01(277):(0-163):(109-272):94.51 * TRAJ48*01(61):(167-224):(3-60):100.00 TRAC(935):(225-273):(0-48):97.96 CDR1(0-0):0.00=null CDR2(35-52):100.00=TCCATCTTCTCTGATGGT CDR3(149-193):100.00=TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT
GACAGCATTCTGCGAAAGGCCTTGAGGTGCTAGTGTCCATCTTCTCTGATGGTGAAAAGGAAGAAGGCAGATTTACAGCTCACCTCAATAGAGCCAACTTGCATGTTTCCCTACACATCAGAGAACCACAACCCAGTGACTCTGCTGTCTACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTTGGGGCTGGAACCAAACTCACCATTAAACCCAACATCCAAAACCCAGAACCTGCTGTGTACCAGTTAAAAGATCCTCGGTC
>CAGATCTGGCACTGTCTAAGCGTT_9134 242 1.50 TRAV7-5*03(163):(3-161):(0-158):100.00,TRAV7D-5*01(277):(0-161):(111-272):96.30,TRAV7N-5*01(277):(0-161):(111-272):94.44 * TRAJ48*01(61):(165-222):(3-60):100.00 TRAC(935):(223-241):(0-18):100.00 CDR1(0-0):0.00=null CDR2(33-50):100.00=TCCATCTTCTCTGATGGT CDR3(147-191):100.00=TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT
CAGCATTCTGCGAAAGGCCTTGAGGTGCTAGTGTCCATCTTCTCTGATGGTGAAAAGGAAGAAGGCAGATTTACAGCTCACCTCAATAGAGCCAACTTGCATGTTTCCCTACACATCAGAGAACCACAACCCAGTGACTCTGCTGTCTACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTTGGGGCTGGAACCAAACTCACCATTAAACCCAACATCCAGAACCCAGAACC

I have analyzed these sequence by Igblast and found that they all lost FR1 and CDR1 region. It is worth noting that in all assemblies, the V gene and J gene predicted by Igblast are consistent with those predicted by Trust4. I think the productive results are reliable. But they should not be marked as full length. The following table is a summary of the results of Igblast.

ID Representative query name Count Frequency(%) CDR3 nucleotide seq CDR3 amino acid seq Productive Chain type V gene D gene J gene
1a CGATGTTTTCGTTAGCTTGGTATG_14739 317 1.78 TRAV7 1 6.67 GCAGTGTTGAACTATGGAAATGAGAAAATAACT AVLNYGNEKIT No VA TRAV7-5*03 N/A TRAJ48*01
1b TAGCTTGTGCAACATTTACCACCA_9114 413 2.10 TRAV7- 7 46.7 GCAGTGTTGAACTATGGAAATGAGAAAATAACT AVLNYGNEKIT Yes VA TRAV7-5*03 N/A TRAJ48*01
2 TGACCACTGTTGTCGGGGTGAGTT_28159 290 85.74 TRAV 2 13.3 GCACCCGCATCTTCTGGCAGCTGGCAACTCATC APASSGSWQLI Yes VA TRAV7-3*02 N/A TRAJ22*01
3a CAGATCTGTCGTTAGCTCTCGGTT_110723 283 0.60 TRAV 1 6.67 GCAGTGTTGAACTATGGAAATGAGAAAATAACT AVLNYGNEKIT No VA TRAV7-5*01 N/A TRAJ48*01
3b TAGCTTGTTCCGTCTTTTCTGTGT_11501 282 1.80 TRAV7 1 6.67 GCAGTGTTGAACTATGGAAATGAGAAAATAACT AVLNYGNEKIT Yes VA TRAV7-5*01 N/A TRAJ48*01
4 TGACCACTTATGCCAGGGCTACAG_17809 433 28.17 TRAV 2 13.3 GCTTTGGAACCGGATTCTGGCAGCTGGCAACTCATC ALEPDSGSWQLI Yes VA TRAV13-1*02 N/A TRAJ22*01
5 TAGCTTGTTAACGCTGTCCGTCTT_17879 427 50.97 TRAV 1 6.67 GCACCCGCATCTTCTGGCAGCTGGCAACTCATC APASSGSWQLI Yes VA TRAV7-3*04 N/A TRAJ22*01
mourisl commented 3 years ago

Yes, there are some mouse V genes lacking of CDR1 portion in IMGT, which actually caused the issue #24 . From the annotation perspective, the assembly still covers all the V gene sequence in the database, so I think it is fair to call it full-length assembly. These V gene alleles might be biological meaningful, such as deficit TCR receptor, so IMGT keeps them in the database. Depends on the purpose, maybe one can further filter the full-length assemblies based on the existence of CDR1,2,3 as you did, or the assembly length.

zhouxinseeu commented 3 years ago

Thank you for your help. I also want to ask you about --repseq. This option can be used to accelerate TCR/BCR seq, but the result is different from not using this option. Is this result normal?

no --repseq:

     CloneId          TRA_CDR3aa       TRB_CDR3aa  Frequent Proportion
 0          1     YLCAVLNYGNEKITF   CASSSWGSQNTLYF       242      14.4%
 1          2    YYCALDITGNTGKLIF   CASSFWGNQDTQYF       163       9.7%
 2          3    YYCALDITGNTGKLIF   CASSSWGSQNTLYF        87      5.18%
 3          4  YYCALRANYGSSGNKLIF   CASSLEPNYAEQFF        30      1.78%
 4          5    YYCALDITGNTGKLIF   CASSRDRGRAEQFF        29      1.73%
 5          6   YFCAVRGSGGSNAKLTF   CASSRDRGRAEQFF        28      1.67%
 6          7    YLCALEPDSGSWQLIF  CASSQRADSAETLYF        17      1.01%
 7          8   YYCAMSEGYWQWWKTHF   CASSSWGSQNTLYF        15      0.89%
 8          9   YYCAMSEGYWQWWKTHF   CASSFWGNQDTQYF        15      0.89%
 9         10    YYCALDITGNTGKLIF   CASSLEPNYAEQFF        14      0.83%
 10        11    YYCALDITGNTGKLIF             None        14      0.83%

--repseq:

     CloneId          TRA_CDR3aa         TRB_CDR3aa  Frequent Proportion
 0          1     YLCAVLNYGNEKITF     CASSSWGSQNTLYF       245     14.57%
 1          2    YYCALDITGNTGKLIF     CASSFWGNQDTQYF       163       9.7%
 2          3    YYCALDITGNTGKLIF     CASSSWGSQNTLYF       100      5.95%
 3          4    YYCALDITGNTGKLIF     CASSRDRGRAEQFF        36      2.14%
 4          5  YYCALRANYGSSGNKLIF     CASSLEPNYAEQFF        30      1.78%
 5          6   YFCAVRGSGGSNAKLTF     CASSRDRGRAEQFF        28      1.67%
 6          7    YLCALEPDSGSWQLIF    CASSQRADSAETLYF        17      1.01%
 7          8    YYCALDITGNTGKLIF     CASSLEPNYAEQFF        17      1.01%
 8          9   YYCAMSEGYWQWWKTHF     CASSFWGNQDTQYF        16      0.95%
 9         10   YYCAMSEGYWQWWKTHF     CASSSWGSQNTLYF        15      0.89%
 10        11    YYCALDITGNTGKLIF               None        14      0.83%
mourisl commented 3 years ago

Thanks for the detailed tests and comparisons. Yes, when using --repseq, TRUST4 will drastically trim the portion of reads that are not in the annotation, so the assembly results will be a bit different. In my test, --repseq's results are slightly worse than the regular setting on bulk RNA-seq data. Maybe for the amplified data, --repseq works better by implicitly removing primers or other adapter sequences in the data, so the sensitivity in your data seems a bit higher for --repseq option. In sum, the results are expected to be a bit different.

zhouxinseeu commented 3 years ago

Okay, thank you for answering my confusion.