Closed zhouxinseeu closed 3 years ago
Could you please show me how you sum the read count? Did you sum up the 7th subcolumn in the comma-splitter entries in the barcode_report file? Thank you.
I extract read count from the 7th subcolumn in the chain1
and chain2
.
The code for extraction:
import pandas as pd
res = pd.read_csv('/SGRNJ03/randd/zhouxin/zx_test/TRUST4/nju_test/nju_test_barcode_report.tsv', sep='\t')
rows = res.shape[0]
TRB_Vs, TRB_Ds, TRB_Js, TRB_Cs, TRB_cdr3nts, TRB_cdr3aas, TRB_readcounts, TRB_fls = [], [], [], [], [], [], [], []
for i in range(rows):
TRBs = res.loc[i, 'chain1']
TRBs = TRBs.split(',')
if len(TRBs) == 10:
TRB_V, TRB_D, TRB_J, TRB_C, TRB_cdr3nt, TRB_cdr3aa, TRB_readcount, TRB_fl = TRBs[0], TRBs[1], TRBs[2], TRBs[3], TRBs[4], TRBs[5], TRBs[6], TRBs[-1]
TRB_Vs.append(TRB_V)
TRB_Ds.append(TRB_D)
TRB_Js.append(TRB_J)
TRB_Cs.append(TRB_C)
TRB_cdr3nts.append(TRB_cdr3nt)
TRB_cdr3aas.append(TRB_cdr3aa)
TRB_readcounts.append(TRB_readcount)
TRB_fls.append(TRB_fl)
elif len(TRBs) != 10:
TRB_Vs.append('NAN')
TRB_Ds.append('NAN')
TRB_Js.append('NAN')
TRB_Cs.append('NAN')
TRB_cdr3nts.append('NAN')
TRB_cdr3aas.append('NAN')
TRB_readcounts.append('NAN')
TRB_fls.append('NAN')
res['TRB_V'] = TRB_Vs
res['TRB_D'] = TRB_Ds
res['TRB_J'] = TRB_Js
res['TRB_C'] = TRB_Cs
res['TRB_cdr3nt'] = TRB_cdr3nts
res['TRB_cdr3aa'] = TRB_cdr3aas
res['TRB_readcount'] = TRB_readcounts
res['TRB_fl'] = TRB_fls
The new result is like:
#barcode cell_type \
0 TAGCTTGTTCGAAGTGTTCTGTGT abT
1 ACAGTGGTTCTGCTGTTTGGTATG abT
2 CGATGTTTTTGCGTACTATGTGGC abT
3 TGACCACTGGCTACAGTCAGGAGG abT
4 CGATGTTTTCCTCAATTGTGGTTG abT
chain1 \
0 TRBV29*01,TRBD2*01,TRBJ2-1*01,TRBC1,TGTGCTAGCA...
1 TRBV14*01,TRBD2*01,TRBJ2-4*01,TRBC1,TGTGCCAGCA...
2 TRBV17*01,TRBD1*01,TRBJ2-1*01,TRBC1,TGTGCTAGCA...
3 TRBV14*01,TRBD2*01,TRBJ2-4*01,TRBC1,TGTGCCAGCA...
4 TRBV14*01,TRBD2*01,TRBJ2-4*01,TRBC1,TGTGCCAGCA...
chain2 \
0 TRAV9N-3*01,*,TRAJ30*01,TRAC,TACTTCTGTGCTGTGAG...
1 TRAV7-5*01,*,TRAJ48*01,TRAC,TACCTCTGTGCAGTGTTG...
2 TRAV16N*01,*,TRAJ34*02,TRAC,TATTTCTGTGCTATGAGG...
3 TRAV9D-4*04,*,TRAJ32*01,TRAC,TACTTCTGTGCTGTGAG...
4 TRAV6-6*01,*,TRAJ37*01,TRAC,TACTACTGTGCTCTGGAT...
secondary_chain1 \
0 TRBV14*01,TRBD2*01,TRBJ2-4*01,TRBC1,TGTGCCAGCA...
1 TRBV31*01,TRBD2*01,TRBJ2-4*01,TRBC1,TGTGCCTGGG...
2 TRBV16*01,TRBD1*01,TRBJ2-5*01,TRBC1,TGTGCAAGCA...
3 TRBV14*01,TRBD2*01,TRBJ2-2*01,TRBC1,TGTGCCAGCA...
4 *
secondary_chain2 TRB_V TRB_D \
0 TRAV3N-3*01,*,TRAJ18*01,TRAC,TACTTCTGCGCAGAAGG... TRBV29*01 TRBD2*01
1 TRAV6-6*01,*,TRAJ37*01,TRAC,TACTACTGTGCTCTGGAT... TRBV14*01 TRBD2*01
2 TRAV4N-4*01,*,TRAJ21*01,TRAC,TACTTCTGTGCTCCCAT... TRBV17*01 TRBD1*01
3 TRAV6D-6*02,*,TRAJ37*01,TRAC,TACTACTGTGCTCTGGA... TRBV14*01 TRBD2*01
4 TRAV6-6*01,*,TRAJ22*01,TRAC,TACTACTGTGCTCTGGGT... TRBV14*01 TRBD2*01
TRB_J TRB_C ... TRB_readcount TRB_fl TRA_V TRA_D \
0 TRBJ2-1*01 TRBC1 ... 14.18 1 TRAV9N-3*01 *
1 TRBJ2-4*01 TRBC1 ... 7.10 1 TRAV7-5*01 *
2 TRBJ2-1*01 TRBC1 ... 17.60 1 TRAV16N*01 *
3 TRBJ2-4*01 TRBC1 ... 91.60 1 TRAV9D-4*04 *
4 TRBJ2-4*01 TRBC1 ... 3.79 0 TRAV6-6*01 *
TRA_J TRA_C TRA_cdr3nt \
0 TRAJ30*01 TRAC TACTTCTGTGCTGTGAGCATGAAGGGCACAAATGCTTACAAAGTCA...
1 TRAJ48*01 TRAC TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT
2 TRAJ34*02 TRAC TATTTCTGTGCTATGAGGGGCTCTTCCAATACCAACAAAGTCGTCTTT
3 TRAJ32*01 TRAC TACTTCTGTGCTGTGAGGACGAATTATGGGAGCAGTGGCAACAAGC...
4 TRAJ37*01 TRAC TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT
TRA_cdr3aa TRA_readcount TRA_fl
0 YFCAVSMKGTNAYKVIF 24.31 0
1 YLCAVLNYGNEKITF 13.73 1
2 YFCAMRGSSNTNKVVF 4.00 1
3 YFCAVRTNYGSSGNKLIF 79.20 0
4 YYCALDITGNTGKLIF 21.04 0
[5 rows x 22 columns]
The code for sum:
import numpy as np
trbreads = [float(i) for i in res['TRA_readcount'].tolist()]
npl = np.array(trbreads, dtype=np.float)
npl = npl[~np.isnan(npl)]
npl.sum()
Is there any problem with my process?
Your script looks fine for me. I assume this is a 5' 10x kit. Then most of the reads will be on the 5' end, which is in the V gene region, and TRUST4 will assemble those regions too. So it is normal that only a small fraction of assembled reads can fall in the CDR3 region.
I just noticed an issue in your command, for a typical 5' 10x kit, the barcode range should be "0 15 +", and the other 8 nucleotides are for UMI. If you use all of them, it will overcount the number of cells.
Thank you for your quick reply.
So you mean read count in the 7th subcolumn in the comma-splitter entries of barcode report file aims to CDR3 region, not TRA/TRB full length including V(D),J region?
As for the barcode problem you mentioned, the kit I used is not 10X genomics, but Singleon GEXSCOPE Immune Repertoire Profiling kit, of which barcode is 24 bases. It captures mRNA from the 3'end. Does this also affect the results?
Yes, the 7th column is only the CDR3 read count. I don't know how much reads from GEXSCOPE 3' kit are fall into the 3' end, like in constant gene. TRUST4 assembles part of the constant gene too, so those assembled reads could from J and constant genes. From my experience with 10x data, 3' kit has very few read covering the CDR3 region, so the sensitivity is much lower comparing with using 5' kit.
Thank you for your suggestion. I will follow this problem.
By rereading your file, the few lines in the barcode_report file have both chains, which are very rare in 10x 3' kit. So I'm wondering did you apply TRUST4 on the GEXSCOPE 3'kit data for the transcriptome data part or the immune profiling data part? If it is immune profiling, then it might be the case that many assembled reads are in the V or constant gene region. Thank you.
Hi, mouris!
Sorry for the late reply. My data is the immune profiling data, not transcriptome data. Does trust4 need special process or cmd for immune profiling data? I saw that trust4 has --repseq
argument. Is it for immune profiling data?
By the way, when I checked the results of trust4, I found that one cell (barcode) failed for both full length TRA and TRB chains in the barcode_report.tsv
. However, I can find full length TRA and TRB chains of the same cell (barcode) in annot.fa
. I compared the read count of cdr3 in these two results, and found that chain1 and chain2 in barcode_report.tsv
have the highest number of reads. Do you keep TRA or TRB based on the number of cdr3 reads?
Thank you!
Since TCR/BCR-seq data may have too many reads and TRUST4 can be very slow ,--repseq is just for speed up,
Could you please share the two chains' assemblies in the _annot.fa file with its header? One possible reason is that the CDR3 is not annotated correctly, and got filtered out when generating barcode_report.tsv file. Within the same barcode, chain1 and chain2 should have the highest number of supporting read, and other CDR3 with fewer read supporting would be in the last two columns.
Chains head in annot.fa
TRA chain:
>GCCAATGTTATGTGGCGACCTTAG_9059 381 11.10 TRAV7-5*01(277):(0-176):(96-272):96.61,TRAV7D-5*01(277):(0-176):(96-272):96.61 * TRAJ48*01(61):(180-237):(3-60):100.00 TRAC(935):(238-380):(0-142):100.00 CDR1(0-0):0.00=null CDR2(48-65):100.00=TCCATCTTCTCTGATGGT CDR3(162-206):100.00=TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT TTCAGGTGGTACAGACAGCATTCTGCGAAAGGCCTTGAGGTGCTAGTGTCCATCTTCTCTGATGGTGAAAAGGAAGAAGGCAGATTTACAGCTCACCTCAATAGAGCCAACTTGCATGTTTCCCTACACATCAGAGAACCACAACCCAGTGACTCTGCTGTCTACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTTGGGGCTGGAACCAAACTCACCATTAAACCCAACATCCAGAACCCAGAACCTGCTGTGTACCAGTTAAAAGATCCTCGGTCTCAGGACAGCACCCTCTGCCTGTTCACCGACTTTGACTCCCAAATCAATGTGCCGAAAACCATGGAATCTGGAACGTTCATCACTGACAAAACT
TRB chain:
>GCCAATGTTATGTGGCGACCTTAG_20225 276 30.59 TRBV14*01(289):(0-104):(181-285):100.00 TRBD2*01(14):(106-110):(4-8):100.00 TRBJ2-4*01(49):(113-161):(0-48):100.00 TRBC1(737):(162-261):(0-99):100.00,TRBC2(692):(162-261):(0-99):100.00 CDR1(0-0):0.00=null CDR2(0-0):0.00=null CDR3(92-133):100.00=TGTGCCAGCAGTTCCTGGGTCAGTCAAAACACCTTGTACTTT CCTCGGATCGATTTTCTGCTGTGAGGCCTAAAGGAACTAACTCCACTCTCAAGATCCAGTCTGCAAAGCAGGGCGACACAGCCACCTATCTCTGTGCCAGCAGTTCCTGGGTCAGTCAAAACACCTTGTACTTTGGTGCGGGCACCCGACTATCGGTGCTAGAGGATCTGAGAAATGTGACTCCACCCAAGGTCTCCTTGTTTGAGCCATCAAAAGCAGAGATTGCAAACAAACAAAAGGCTACCCTCGTGTGCTTGGCCAGATCGGAAGAGCACA
Full length chain but not in barcode_report.tsv:
TRA:
>GCCAATGTTATGTGGCGACCTTAG_2199 543 19.02 TRAV6D-6*02(287):(78-360):(0-282):100.00,TRAV6-6*03(282):(81-360):(0-279):97.14 * TRAJ37*01(60):(364-423):(0-59):100.00 TRAC(935):(424-542):(0-118):95.80 CDR1(153-176):100.00=TCAGCCACAAGCATAGCTTACCCT CDR2(228-248):100.00=AAAGTCATTACGGCTGGCCAG CDR3(345-392):100.00=TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT ACTTTCTAGATGACACTAAAGATGGACTCTTCTCCAGGCTTCGTGGCTGTGATACTTCTCATACTTGGAAGGACCCACGGAGATTCCGTGACTCAAACAGAAGGCCCAGTGACCGTCTCAGAAAGCGAGTCCCTGATAATAAATTGCACGTATTCAGCCACAAGCATAGCTTACCCTAATCTTTTCTGGTATGTTCGATATCCTGGAGAAGGTCTACAACTCCTCCTGAAAGTCATTACGGCTGGCCAGAAGGGAAGCAGCAGAGGGTTTGAAGCCACATACAATAAAGAAACCACCTCCTTCCACTTGCAGAAAGCCTCAGTGCAAGAGTCAGACTCGGCTGTGTACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTTGGACTGGGGACAACTTTACAAGTGCAACCAGACATCCAGAACCAAGAACCTGCTGTGTACCAGTTAAAAGATCCTCGGTCTCAGGACAGCACCCTCTGCCTGTTCACAGACTTTGACTCCCAAATCAATGTGCAGAAACCCAGGGAATCT
TRB:
>GCCAATGTTATGTGGCGACCTTAG_24585 580 18.56 TRBV19*01(286):(166-451):(0-285):100.00 * TRBJ2-4*01(49):(454-502):(0-48):100.00 TRBC1(737):(503-579):(0-76):100.00,TRBC2(692):(503-579):(0-76):100.00 CDR1(244-258):100.00=TTCAATCATGATACA CDR2(310-327):100.00=TCAATAACTGAAAACGAT CDR3(436-474):100.00=TGTGCCAGCAGTATAGTCAGTCAAAACACCTTGTACTTT ATTCTAGTCATCATAGTTAGCCAGGTGGAGCCCCTGCATAGCGCGTCTCCTTGTTTCTCTTTTAACTAATGCCCAGAGCCAAAGAAAGTCCCTCCAAACTATGAACAAGTGGGTTTTCTGCTGGGTAACCCTTTGTCTCCTTACTGTAGAGACCACACATGGTGATGGTGGCATCATTACTCAGACACCCAAATTCCTGATTGGTCAGGAAGGGCAAAAACTGACCTTGAAATGTCAACAGAATTTCAATCATGATACAATGTACTGGTACCGACAGGATTCAGGGAAAGGATTGAGACTGATCTACTATTCAATAACTGAAAACGATCTTCAAAAAGGCGATCTATCTGAAGGCTATGATGCGTCTCGAGAGAAGAAGTCATCTTTTTCTCTCACTGTGACATCTGCCCAGAAGAACGAGATGGCCGTTTTTCTCTGTGCCAGCAGTATAGTCAGTCAAAACACCTTGTACTTTGGTGCGGGCACCCGACTATCGGTGCTAGAGGATCTGAGAAATGTGACTCCACCCAAGGTCTCCTTGTTTGAGCCATCAAAAGCAGAGATTGCAAACAAACAAAAG
As you noticed that chain 1 and chain 2 should have the most reads in the barcode_report.tsv, if I filter the results by average_coverage in annot.fa, I should get the same results as barcodr_report.tsv, right?
The actual abundance is in the third last column of the _cdr3.out file, where TRUST4 will realign the assembled reads to quantify each CDR3. Are you using the newest version of TRUST4, there could be a wrong abundance estimation in some earlier versions.
Well, my version is 1.0.2. I will update it to the newest version and check the results again.
I have downloaded TRUST4 of v1.0.4, but the version in the software is still v1.0.2.
$ ls TRUST4-1.0.4
alignments.hpp BuildDatabaseFa.pl FastqExtractor.o KmerCount.hpp ReadFiles.hpp trust-barcoderep-to-10X.pl
annotator BuildImgtAnnot.pl FilterAnnotatedAssembly.pl KmerIndex.hpp README.md trust-cluster.py
Annotator.cpp BuildImgtVquestAnnot.pl GetFullLengthAssembly.pl kseq.h run-trust4 trust-simplerep.pl
Annotator.o Cgene.list hg19_bcrtcr.fa LICENSE.txt samtools-0.1.19 trust-smartseq.pl
bam-extractor defs.h hg38_bcrtcr.fa main.cpp SeqSet.hpp trust-stats.py
BamExtractor.cpp example human_IMGT+C.fa main.o SimpleVector.hpp
BamExtractor.o fastq-extractor human_vdjc.list Makefile trust4
$ run-trust4
TRUST4 v1.0.2 usage: ./run-trust4 [OPTIONS]:
Required:
-b STRING: path to bam file
-1 STRING -2 STRING: path to paired-end read files
-u STRING: path to single-end read file
-f STRING: path to the fasta file coordinate and sequence of V/D/J/C genes
Optional:
--ref STRING: path to detailed V/D/J/C gene reference file from IMGT database. (default: not used but recommended)
-o STRING: prefix of output files. (default: inferred from file prefix)
--od STRING: the directory for output files. (default: ./)
-t INT: number of threads (default: 1)
--barcode STRING: if -b, bam field for barcode; if -1 -2/-u, file containing barcodes (default: not used)
--barcodeRange INT INT CHAR: start, end(-1 for length-1), strand in a barcode is the true barcode (default: 0 -1 +)
--barcodeWhitelist STRING: path to the barcode whitelist (default: not used)
--read1Range INT INT: start, end(-1 for length-1) in -1/-u files for genomic sequence (default: 0 -1)
--read2Range INT INT: start, end(-1 for length-1) in -2 files for genomic sequence (default: 0 -1)
--mateIdSuffixLen INT: the suffix length in read id for mate. (default: not used)
--skipMateExtension: do not extend assemblies with mate information, useful for SMART-seq (default: not used)
--abnormalUnmapFlag: the flag in BAM for the unmapped read-pair is nonconcordant (default: not set)
--noExtraction: directly use the files from provided -1 -2/-u to assemble (default: extraction first)
--repseq: the data is from TCR-seq or BCR-seq (default: not set)
--stage INT: start TRUST4 on specified stage (default: 0):
0: start from beginning (candidate read extraction)
1: start from assembly
2: start from annotation
3: start from generating the report table
My bad, I forgot to update the version number in the wrapper, and the code should be 1.0.4. I just fixed this version number issue. Thank you.
I have checked the new results from TRUST4 v1.0.4.
I got full length assembly fasta file by GetFullLengthAssembly.pl. Then I filtered the full length cells (barcodes) by average_coverage in comment. I got 1577 cells finally.
Also, I filtered _cdr3.out by read_fragment_count and full_length_assembly. I kept the most read_fragment_count and full length assembly. I got 1578 cells.
I checked the different cell and found it lost in the full length assembly fasta file, which means this cell is marked as full length in _cdr3.out but no full length assembly.
I found this sequence in _annot.fa:
>TGACCACTTGACCACTTACCACCA_47779 434 19.54 TRAV16*01(290):(132-238):(0-106):100.00 * TRAJ4*02(63):(319-380):(1-62):100.00 TRAC(935):(381-433):(0-52):100.00 CDR1(207-227):100.00=GAAACCCAGGACAGTTCTTAC CDR2(0-0):0.00=null CDR3(299-349):50.00=TATTTCTGTGCTATGAGAGGGTTATCTGGTAGCTTCAATAAGTTGACCTTT
ACAAAGTATAGATTAAGATCTGGTTGAGAGAGAACATAGACAACTTCACTCAAGACCAGAGCTAACAGTATGCTGATTCTAAGCCTGTTGGGAGCTCCATTTTTTGGCTCCATTTGTTTTGCAACCAGCATGGCCCAGAAGGTAACACAGACTCAGACTTCAATTTCTGTGATGGAGAAGACAACGGTGACAATGGACTGTGTGTATGAAACCCAGGACAGTTCTTACTTCTTATTCTGAAGCCAAAAAGTTCCATCGGACTCATCATCACCGCCACACAGATTGAGGACTCGGCAGTATATTTCTGTGCTATGAGAGGGTTATCTGGTAGCTTCAATAAGTTGACCTTTGGAGCAGGGACCAGACTGGCTGTGTGCCCATACATCCAGAACCCAGAACCTGCTGTGTACCAGTTAAAAGATCCTCGGTCTCAG
cdr3.out:
consensus_id index_within_consensus V_gene D_gene J_gene C_gene CDR1 CDR2 CDR3 CDR3_score read_fragment_count CDR3_germline_similarity full_length_assembly barcode
17610 TGACCACTTGACCACTTACCACCA_47779 0 TRAV16*01 * TRAJ4*02 TRAC GAAACCCAGGACAGTTCTTAC * TATTTCTGTGCTATGAGAGGGTTATCTGGTAGCTTCAATAAGTTGA... 0.5 1.51 0.0 1 TGACCACTTGACCACTTACCACCA
Its CDR2 region is null. Is it a mistake for this sequence to be marked as full length assembly?
Right, this one should not be marked as full length. I just found and fix the bug in the annotator. You can run TRUST4 with the option --stage 2 so it can restart from the annotation step, which should give you the correct full-length labeling. Thank you!
I am still confused about the barcode_report.tsv. I want to know if barcode_report.tsv is related to cdr3.out. According to your statement, select the chain with the most reads to keep, then the results of chain1 and chain2 in barcode_report.tsv should be consistent with the filtering results of cdr3.out, but now it is found that the results of the two are not consistent. I don't know if there are any problems with my thoughts, and I hope you can answer my doubts.
top10 clone types in barcode_report.tsv:
TRA_cdr3aa TRB_cdr3aa Frequent
YLCAVLNYGNEKITF CASSSWGSQNTLYF 244
YYCALDITGNTGKLIF CASSFWGNQDTQYF 166
CASSSWGSQNTLYF 91
YYCALRANYGSSGNKLIF CASSLEPNYAEQFF 30
YYCALDITGNTGKLIF CASSRDRGRAEQFF 30
YFCAVRGSGGSNAKLTF CASSRDRGRAEQFF 28
YYCAMSEGYWQWWKTHF CASSSWGSQNTLYF 17
YLCALEPDSGSWQLIF CASSQRADSAETLYF 17
YYCAMSEGYWQWWKTHF CASSFWGNQDTQYF 15
YYCALDITGNTGKLIF CASSLEPNYAEQFF 14
top10 clone types in cdr3.out:
CDR3_x CDR3_y Frequent
YLCAVLNYGNEKITF CASSSWGSQNTLYF 204
YYCALDITGNTGKLIF CASSFWGNQDTQYF 153
CASSSWGSQNTLYF 73
YHCILRVKEGLRSSS CASSSWGSQNTLYF 45
YFCAVRGSGGSNAKLTF CASSRDRGRAEQFF 26
YYCALDITGNTGKLIF CASSRDRGRAEQFF 24
YLCALEPDSGSWQLIF CASSQRADSAETLYF 16
YYCALRANYGSSGNKLIF CASSLEPNYAEQFF 16
CAWGDKKTPCT 16
YYCAMSEGYWQWWKTHF CASSFWGNQDTQYF 15
Thank you!
There are still CDR1 null value in full length fasta file generated by GetFullLengthAssembly.pl. Is this correct? By the way, what does average_coverage
specifically mean?
Could you please share the _cdr3.out file and your script for generating the two tables. I will look into this. Thank you.
partial cdr3.out:
ATCACGTTTACTAGTCTGCGATCT_0 0 TRAV6-6*01,TRAV6-7/DV9*01,TRAV6D-5*01 * TRAJ37*01 TRAC * * TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 3.71 100.00 0
TAGCTTGTTCGAGCGTGCATGGCT_1 0 TRAV9D-4*04 * TRAJ42*01 TRAC TCCTACTCTGGGACGCCT AAGTACTATTCCGGAGACCCAGTG
TACTTCTGTGCTGTGAGGGGTTCTGGAGGAAGCAATGCAAAGCTAACCTTC 1.00 289.82 100.00 1
CGATGTTTTGTACCTTTGAAGCCA_2 0 TRAV6D-6*02 * TRAJ57*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACCACTGTATCCTGAGAGTAAAGGAGGGTCTGCGAAGCTCATCTTT 1.00 338.23 96.97 1
GATCAGCGTGTATGCGGCTCCTTG_4 0 TRAV6D-6*02 * TRAJ48*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCGGAAATGAGAAAATAACTTTT 1.00 13.85 100.00 1
GATCAGCGTGTACCTTTTGTTCCA_5 0 TRAV6D-6*02 * TRAJ37*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 134.88 100.00 1
CAGATCTGTGACCACTGGTCGTGT_9 0 TRAV6D-6*02,TRAV6-7/DV9*01,TRAV6-6*01 * TRAJ37*01 TRAC * * TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 2.69 100.00 0
TGACCACTGCAATCCGTAAGTTCG_10 0 TRAV6D-6*02 * TRAJ37*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 158.44 100.00 1
CGATGTTTGACGGATTGATAGAGG_11 0 TRAV6-6*01,TRAV6-7/DV9*01,TRAV6D-5*01 * TRAJ37*01 TRAC * * TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 2.33 100.00 0
TGACCACTTCGTTAGCTACCGAGC_12 0 * * TRAJ37*01 TRAC * * TGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 0.00 1.12 0.00 0
GCCAATGTTAGACGGATAAGTTCG_13 0 TRAV7-5*01,TRAV7D-5*01 * TRAJ48*01 TRAC AGTGATGGTACTTCTAAC TCCATCTTCTCTGATGGT TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT 1.00 50.18 97.50 1
TTAGGCATGATAGAGGGAATCTGT_15 0 TRAV6D-6*02 * TRAJ37*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 92.12 100.00 1
CAGATCTGTCTCGGTTTTGACTCT_16 0 TRAV6D-6*02 * TRAJ37*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 90.69 100.00 1
TTAGGCATTGACCACTGATCAGCG_18 0 TRAV6D-6*02 * TRAJ37*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 99.06 100.00 1
TGACCACTGCAATCCGTATGTGGC_19 0 TRAV6D-6*02 * TRAJ37*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 12.73 100.00 1
CGATGTTTGGTCGTGTTGTTCTCC_20 0 TRAV6D-6*02 * TRAJ37*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 7.71 100.00 1
GCCAATGTGTGTCCTTTGCATAGT_21 0 TRAV6D-6*02 * TRAJ37*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 4.77 100.00 1
GCCAATGTTGTTCTCCTGTGGTTG_22 0 TRAV6D-6*02 * TRAJ42*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TGTGCTGTGAGGGGTTCTGGAGGAAGCAATGCAAAGCTAACCTTC 0.50 100.56 0.00 0
TTAGGCATGACGGATTTCGAGCGT_23 0 TRAV6D-6*02 * TRAJ37*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 69.96 100.00 1
CGATGTTTTACCGAGCTAACGCTG_24 0 TRAV7-5*01,TRAV7D-5*01 * TRAJ48*01 TRAC AGTGATGGTACTTCTAAC TCCATCTTCTCTGATGGT TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT 1.00 3.56 97.50 1
ACTTGATGTCCGTCTTGACGGATT_25 0 TRAV6D-6*02 * TRAJ37*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 12.31 100.00 0
ACAGTGGTGCACTGTCTCATCCTA_27 0 TRAV6D-6*02 * TRAJ26*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATGCCCAGGGATTAACCTTC 1.00 5.23 100.00 1
GCCAATGTTACAGGATGAGGTGCT_28 0 TRAV6D-6*02 * TRAJ37*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 14.15 100.00 1
TGACCACTTCGAAGTGCAGATCTG_29 0 TRAV6D-6*02 * TRAJ37*01 TRAC TCAGCCACAAGCATAGCTTACCCT AAAGTCATTACGGCTGGCCAG TACTACTGTGCTCTGGATATAACAGGCAATACCGGAAAACTCATCTTT 1.00 136.98 100.00 1
code for cdr3.out:
cdr3_out = pd.read_csv('/SGRNJ03/randd/zhouxin/zx_test/trust_test/NJU_Pt_2_2_3Nlib_1.0.4//02.trust_assemble/TRUST4/NJU_Pt_2_2_3Nlib_1.0.4_cdr3.out', sep='\t', names=['consensus_id','index_within_consensus','V_gene','D_gene','J_gene','C_gene','CDR1','CDR2','CDR3','CDR3_score','read_fragment_count','CDR3_germline_similarity','full_length_assembly'])
cdr3_out['barcode'] = cdr3_out['consensus_id'].apply(lambda x: x.split('_')[0])
#cdr3_out = cdr3_out[cdr3_out['full_length_assembly']==1]
tra = cdr3_out[cdr3_out['V_gene'].str.contains('TRA', na=False)]
tra = tra.sort_values(by='read_fragment_count', ascending=False)
tra = tra.drop_duplicates('barcode', 'first')
trb = cdr3_out[cdr3_out['V_gene'].str.contains('TRB', na=False)]
trb = trb.sort_values(by='read_fragment_count', ascending=False)
trb = trb.drop_duplicates('barcode', 'first')
cdr3_res = pd.merge(tra, trb, on='barcode', how='outer').fillna('None')
cdr3_res['CDR3_x'] = cdr3_res['CDR3_x'].apply(lambda x: str(Seq(x).translate()) if (x != 'None') else 'None')
cdr3_res['CDR3_y'] = cdr3_res['CDR3_y'].apply(lambda x: str(Seq(x).translate()) if (x != 'None') else 'None')
res2 = cdr3_res.groupby(['CDR3_x', 'CDR3_y']).agg({'barcode': 'count'})
res2 = res2.sort_values(by='barcode', ascending=False)
res2
code for barcode_report.tsv:
def beauty_res(outdir, barcode_report):
res = pd.read_csv(barcode_report, sep='\t')
rows = res.shape[0]
loci = ['A', 'B']
chians = ['chain2', 'chain1']
for l in range(len(loci)):
chain = chians[l]
locus = loci[l]
Vgenes, Dgenes, Jgenes, Cgenes, cdr3nts, cdr3aas, readcounts, fuls = [], [], [], [], [], [], [], []
for i in range(rows):
attr = res.loc[i, chain]
attrs = attr.split(',')
if len(attrs) == 10:
V, D, J, C, cdr3nt, cdr3aa, readcount, fl = attrs[0], attrs[1], attrs[2], attrs[3], attrs[4], attrs[5], attrs[6], attrs[-1]
Vgenes.append(V)
Dgenes.append(D)
Jgenes.append(J)
Cgenes.append(C)
cdr3nts.append(cdr3nt)
cdr3aas.append(cdr3aa)
readcounts.append(readcount)
fuls.append(fl)
elif len(attrs) != 10:
Vgenes.append('NAN')
Dgenes.append('NAN')
Jgenes.append('NAN')
Cgenes.append('NAN')
cdr3nts.append('NAN')
cdr3aas.append('NAN')
readcounts.append('NAN')
fuls.append('NAN')
res[f'TR{locus}_V'] = Vgenes
res[f'TR{locus}_D'] = Dgenes
res[f'TR{locus}_J'] = Jgenes
res[f'TR{locus}_C'] = Cgenes
res[f'TR{locus}_cdr3nt'] = cdr3nts
res[f'TR{locus}_cdr3aa'] = cdr3aas
res[f'TR{locus}_readcount'] = readcounts
res[f'TR{locus}_fl'] = fuls
res.to_csv(f'{outdir}/new_barcode_report.tsv', sep='\t')
return res
barcode_report = pd.read_csv('new_barcode_report.tsv', index_col=0, sep='\t')
res1 = barcode_report.groupby(['TRA_cdr3aa', 'TRB_cdr3aa']).agg({'#barcode': 'count'})
res1 = res1.sort_values(by='#barcode', ascending=False)
res1
If you need the complete file, please let me know and I can send it to your mailbox.
I think the difference comes from the imputed CDR3. If two cells have the same TRB CDR3, and one of the cell has complete TRA CDR3 and the other cell has partial TRA CDR3, if the partial CDR3 is a substring from the complete one and the identified gene cassette usage is the same, TRUST4 will impute the remaining sequences by borrowing the information from another cell. In this case, the consensus id from the barcode_report file will tell which sequence it is imputed from. I tried your script on my own data, and the numbers match by disregarding the imputed CDR3. You can run the "perl trust-barcoderep.pl " with the option "--noImputation" to disable this feature.
I have tried your suggestion. The result is much better than before, but there are still some inconsistencies. Regarding the final result, which one do you think is more reliable? barcode_report or cdr3.out?
new results:
TRA_cdr3aa TRB_cdr3aa Frequent
YLCAVLNYGNEKITF CASSSWGSQNTLYF 242
YYCALDITGNTGKLIF CASSFWGNQDTQYF 163
CASSSWGSQNTLYF 87
YYCALRANYGSSGNKLIF CASSLEPNYAEQFF 30
YYCALDITGNTGKLIF CASSRDRGRAEQFF 29
YFCAVRGSGGSNAKLTF CASSRDRGRAEQFF 28
YLCALEPDSGSWQLIF CASSQRADSAETLYF 17
YYCAMSEGYWQWWKTHF CASSFWGNQDTQYF 15
CASSSWGSQNTLYF 15
YYCALDITGNTGKLIF None 14
previous results:
TRA_cdr3aa TRB_cdr3aa Frequent
YLCAVLNYGNEKITF CASSSWGSQNTLYF 244
YYCALDITGNTGKLIF CASSFWGNQDTQYF 166
CASSSWGSQNTLYF 91
YYCALRANYGSSGNKLIF CASSLEPNYAEQFF 30
YYCALDITGNTGKLIF CASSRDRGRAEQFF 30
YFCAVRGSGGSNAKLTF CASSRDRGRAEQFF 28
YYCAMSEGYWQWWKTHF CASSSWGSQNTLYF 17
YLCALEPDSGSWQLIF CASSQRADSAETLYF 17
YYCAMSEGYWQWWKTHF CASSFWGNQDTQYF 15
YYCALDITGNTGKLIF CASSLEPNYAEQFF 14
I have found the reason. I lost the option --reportPartial
. Now I can output the clone types table by analyzing barcode_report.tsv. Another question is that there are still some CDR1 missing in the full length assembly mentioned above. Is this normal?
Thank you!
Can you share the missing CDR1 assembly from the _annot.fa file? One possibility is that the assembly had too many mismatches, deletions and it could identify CDR1 region based on IMGT guideline.
Hi, here are some assemblies with missing CDR1:
>TAGCTTGTGCAACATTTACCACCA_9114 413 2.10 TRAV7-5*03(163):(8-166):(0-158):100.00,TRAV7D-5*01(277):(0-166):(106-272):96.41 * TRAJ48*01(61):(170-227):(3-60):100.00 TRAC(935):(228-408):(0-164):86.71 CDR1(0-0):0.00=null CDR2(38-55):100.00=TCCATCTTCTCTGATGGT CDR3(152-196):100.00=TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT
ACAGACAGCATTCTGCGAAAGGCCTTGAGGTGCTAGTGTCCATCTTCTCTGATGGTGAAAAGGAAGAAGGCAGATTTACAGCTCACCTCAATAGAGCCAACTTGCATGTTTCCCTACACATCAGAGAACCACAACCCAGTGACTCTGCTGTCTACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTTGGGGCTGGAACCAAACTCACCATTAAACCCAACATCCAGAACCCAGAACCTGCTGTGTACCAGTTAAAAGATCCTCGGTCTCAGGACAGCACCCTCTGCCTGTTCACCGACTTTGAATCCCTAATCAATGTTCAGAATACCATTGTATATTGAACGTTCATATCTGTCATTACTGTGATGGACATGAAATCTTTTGATTCAATGAGCTATGGTGCC
>ATCACGTTGCTCCTTGCAGATCTG_9130 274 0.60 TRAV7-5*03(163):(5-163):(0-158):100.00,TRAV7D-5*01(277):(0-163):(109-272):96.34,TRAV7N-5*01(277):(0-163):(109-272):94.51 * TRAJ48*01(61):(167-224):(3-60):100.00 TRAC(935):(225-273):(0-48):97.96 CDR1(0-0):0.00=null CDR2(35-52):100.00=TCCATCTTCTCTGATGGT CDR3(149-193):100.00=TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT
GACAGCATTCTGCGAAAGGCCTTGAGGTGCTAGTGTCCATCTTCTCTGATGGTGAAAAGGAAGAAGGCAGATTTACAGCTCACCTCAATAGAGCCAACTTGCATGTTTCCCTACACATCAGAGAACCACAACCCAGTGACTCTGCTGTCTACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTTGGGGCTGGAACCAAACTCACCATTAAACCCAACATCCAAAACCCAGAACCTGCTGTGTACCAGTTAAAAGATCCTCGGTC
>CAGATCTGGCACTGTCTAAGCGTT_9134 242 1.50 TRAV7-5*03(163):(3-161):(0-158):100.00,TRAV7D-5*01(277):(0-161):(111-272):96.30,TRAV7N-5*01(277):(0-161):(111-272):94.44 * TRAJ48*01(61):(165-222):(3-60):100.00 TRAC(935):(223-241):(0-18):100.00 CDR1(0-0):0.00=null CDR2(33-50):100.00=TCCATCTTCTCTGATGGT CDR3(147-191):100.00=TACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTT
CAGCATTCTGCGAAAGGCCTTGAGGTGCTAGTGTCCATCTTCTCTGATGGTGAAAAGGAAGAAGGCAGATTTACAGCTCACCTCAATAGAGCCAACTTGCATGTTTCCCTACACATCAGAGAACCACAACCCAGTGACTCTGCTGTCTACCTCTGTGCAGTGTTGAACTATGGAAATGAGAAAATAACTTTTGGGGCTGGAACCAAACTCACCATTAAACCCAACATCCAGAACCCAGAACC
I have analyzed these sequence by Igblast and found that they all lost FR1 and CDR1 region. It is worth noting that in all assemblies, the V gene and J gene predicted by Igblast are consistent with those predicted by Trust4. I think the productive results are reliable. But they should not be marked as full length. The following table is a summary of the results of Igblast.
ID | Representative query name | Count | Frequency(%) | CDR3 nucleotide seq | CDR3 amino acid seq | Productive | Chain type | V gene | D gene | J gene |
---|---|---|---|---|---|---|---|---|---|---|
1a | CGATGTTTTCGTTAGCTTGGTATG_14739 317 1.78 TRAV7 | 1 | 6.67 | GCAGTGTTGAACTATGGAAATGAGAAAATAACT | AVLNYGNEKIT | No | VA | TRAV7-5*03 | N/A | TRAJ48*01 |
1b | TAGCTTGTGCAACATTTACCACCA_9114 413 2.10 TRAV7- | 7 | 46.7 | GCAGTGTTGAACTATGGAAATGAGAAAATAACT | AVLNYGNEKIT | Yes | VA | TRAV7-5*03 | N/A | TRAJ48*01 |
2 | TGACCACTGTTGTCGGGGTGAGTT_28159 290 85.74 TRAV | 2 | 13.3 | GCACCCGCATCTTCTGGCAGCTGGCAACTCATC | APASSGSWQLI | Yes | VA | TRAV7-3*02 | N/A | TRAJ22*01 |
3a | CAGATCTGTCGTTAGCTCTCGGTT_110723 283 0.60 TRAV | 1 | 6.67 | GCAGTGTTGAACTATGGAAATGAGAAAATAACT | AVLNYGNEKIT | No | VA | TRAV7-5*01 | N/A | TRAJ48*01 |
3b | TAGCTTGTTCCGTCTTTTCTGTGT_11501 282 1.80 TRAV7 | 1 | 6.67 | GCAGTGTTGAACTATGGAAATGAGAAAATAACT | AVLNYGNEKIT | Yes | VA | TRAV7-5*01 | N/A | TRAJ48*01 |
4 | TGACCACTTATGCCAGGGCTACAG_17809 433 28.17 TRAV | 2 | 13.3 | GCTTTGGAACCGGATTCTGGCAGCTGGCAACTCATC | ALEPDSGSWQLI | Yes | VA | TRAV13-1*02 | N/A | TRAJ22*01 |
5 | TAGCTTGTTAACGCTGTCCGTCTT_17879 427 50.97 TRAV | 1 | 6.67 | GCACCCGCATCTTCTGGCAGCTGGCAACTCATC | APASSGSWQLI | Yes | VA | TRAV7-3*04 | N/A | TRAJ22*01 |
Yes, there are some mouse V genes lacking of CDR1 portion in IMGT, which actually caused the issue #24 . From the annotation perspective, the assembly still covers all the V gene sequence in the database, so I think it is fair to call it full-length assembly. These V gene alleles might be biological meaningful, such as deficit TCR receptor, so IMGT keeps them in the database. Depends on the purpose, maybe one can further filter the full-length assemblies based on the existence of CDR1,2,3 as you did, or the assembly length.
Thank you for your help. I also want to ask you about --repseq
. This option can be used to accelerate TCR/BCR seq, but the result is different from not using this option. Is this result normal?
no --repseq:
CloneId TRA_CDR3aa TRB_CDR3aa Frequent Proportion
0 1 YLCAVLNYGNEKITF CASSSWGSQNTLYF 242 14.4%
1 2 YYCALDITGNTGKLIF CASSFWGNQDTQYF 163 9.7%
2 3 YYCALDITGNTGKLIF CASSSWGSQNTLYF 87 5.18%
3 4 YYCALRANYGSSGNKLIF CASSLEPNYAEQFF 30 1.78%
4 5 YYCALDITGNTGKLIF CASSRDRGRAEQFF 29 1.73%
5 6 YFCAVRGSGGSNAKLTF CASSRDRGRAEQFF 28 1.67%
6 7 YLCALEPDSGSWQLIF CASSQRADSAETLYF 17 1.01%
7 8 YYCAMSEGYWQWWKTHF CASSSWGSQNTLYF 15 0.89%
8 9 YYCAMSEGYWQWWKTHF CASSFWGNQDTQYF 15 0.89%
9 10 YYCALDITGNTGKLIF CASSLEPNYAEQFF 14 0.83%
10 11 YYCALDITGNTGKLIF None 14 0.83%
--repseq:
CloneId TRA_CDR3aa TRB_CDR3aa Frequent Proportion
0 1 YLCAVLNYGNEKITF CASSSWGSQNTLYF 245 14.57%
1 2 YYCALDITGNTGKLIF CASSFWGNQDTQYF 163 9.7%
2 3 YYCALDITGNTGKLIF CASSSWGSQNTLYF 100 5.95%
3 4 YYCALDITGNTGKLIF CASSRDRGRAEQFF 36 2.14%
4 5 YYCALRANYGSSGNKLIF CASSLEPNYAEQFF 30 1.78%
5 6 YFCAVRGSGGSNAKLTF CASSRDRGRAEQFF 28 1.67%
6 7 YLCALEPDSGSWQLIF CASSQRADSAETLYF 17 1.01%
7 8 YYCALDITGNTGKLIF CASSLEPNYAEQFF 17 1.01%
8 9 YYCAMSEGYWQWWKTHF CASSFWGNQDTQYF 16 0.95%
9 10 YYCAMSEGYWQWWKTHF CASSSWGSQNTLYF 15 0.89%
10 11 YYCALDITGNTGKLIF None 14 0.83%
Thanks for the detailed tests and comparisons. Yes, when using --repseq, TRUST4 will drastically trim the portion of reads that are not in the annotation, so the assembly results will be a bit different. In my test, --repseq's results are slightly worse than the regular setting on bulk RNA-seq data. Maybe for the amplified data, --repseq works better by implicitly removing primers or other adapter sequences in the data, so the sensitivity in your data seems a bit higher for --repseq option. In sum, the results are expected to be a bit different.
Okay, thank you for answering my confusion.
Hi,
I have analyzed a simple data by command:
/SGRNJ03/randd/zhouxin/software/TRUST4/run-trust4 -t 10 -u ./NJU_Pt_2_2_3Nlib/r2.fq --barcode ./NJU_Pt_2_2_3Nlib/r1.fq --barcodeRange 0 23 + -f /SGRNJ03/randd/zhouxin/software/TRUST4/mouse/GRCm38_bcrtcr.fa --ref /SGRNJ03/randd/zhouxin/software/TRUST4/mouse/mouse_IMGT+C.fa -o nju_test --od ./nju_test
I get results like:Now, I want to calculate reads count of TRA and TRB. I have tried to sum reads count in nju_test_barcode_report.tsv and found it only accounts for 3% of assembled reads. From the results, the read count in nju_test_barcode_report.tsv is normalized, right? If I want to get the original read count of TRA and TRB, is there any good way?
Thank you for your time!