liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
272 stars 47 forks source link

Uploading to IMGT HighV-Quest #97

Open loverend opened 2 years ago

loverend commented 2 years ago

Hello!

Just wondered - of the fa output files generated from TRUST4 which would be the one to upload to IMGT high V-quest? We use this in an in-house pipeline and like the extra information this can provide above V/D/J assignment. Would it be _assembled_reads.fa or _annot.fa? Also by this point have they been condensed with a count of the number of times the sequence occurs or is that done after?

Thanks,

mourisl commented 2 years ago

You can use the _annot.fa file. For the abundance/count information, you can obtain that from the _cdr3.out file using the sequence id as the index.

loverend commented 2 years ago

Thank you - assuming the count would be the number of duplications of the assembly id eg assemble4134 appearing in that CDR3 file? Would it make sense to upload these files to IMGT? I am interested for example in SHM and the information it can provide on nb of mutations across different regions of the V gene which I don't think(?) Trust4 provides as a standard output? We have matching multiplex PCR TCR/BCR RNAseq which we ran through IMGT so I am hopeful I can try and match/compare these.

mourisl commented 2 years ago

The third last column in the _cdr3 file is the abundance. You need to sum up the values for a particular sequence id.

To obtain base-level resolution alignment in TRUST4, you can extract the full-length assemblies from the annot file based on the last column of the _cdr3 file, and then rerun the annotator program with option "--geneAlignment", it will output the base level alignment information for V, D, J, C genes separated by space: "=" is for match, "ACGT" is mismatch, "acgt" is insertion, and "\" is deletion.

loverend commented 2 years ago

Hi - I have come back to do some more on this and wondered if you could explain please how the different count columns are generated. For example for report.tsv assemble 1190 the first count column is 1: 1 5.589684e-04 TGTGTGAGGGAAGACGAGGCCTACAATTGG CVREDEAYNW IGHV4-38-201 IGHD1-101 IGHJ4*02 IGHA1 assemble1190 0

for _cdr3: assemble1190 0 IGHV4-38-201,IGHV4-30-204 IGHD1-101 IGHJ402,IGHJ502 IGHA101 TGTGTGAGGGAAGAAGAGGCCTACAATTGG 1.00 4.00 88.89 0 assemble1190 1 IGHV4-38-201,IGHV4-30-204 IGHD1-101 IGHJ402,IGHJ502 IGHA101 TGTGTGAGGGAAGACGAGGCCTACAATTGG 1.00 1.00 88.89 0

I am a bit confused as the report file would suggest the count is 1 but using the cdr3 file abundance the sum of the 3 last column would suggest the count is 5?

Edit - Ah I see the assemble order changes between the files so they are no longer adjacent. Just to confirm 5 would be the correct read cout to provide with the assemble1190 fastq sequence from annot for IMGT? Thanks

mourisl commented 2 years ago

The reason for why their abundance is not summed is because the CDR3 nucleotide sequences are different. The CDR3 with abundance 1 is the minor CDR3 encoded in the consensus of assembly1190. If you want to use the assembly sequence containing V,J sequences for "TGTGTGAGGGAAGACGAGGCCTACAATTGG", you can use the sequence from the new AIRR output from TRUST4, which automatically replaces the assembly1190 CDR3 region with the minor CDR3.

loverend commented 2 years ago

I am yet to update TRUST4 but I can do that.

Essentially I want to run the constructed contigs from BULK RNAseq through IMGT after I have done some trimming (V/J trim of the contigs) to make them match an in house-multiplex PCR repertoire sequencing experiment that we have done which has essentially fixed lengths for sequences which are then aligned to IMGT. Currently because I have all different lengths of V gene etc from TRUST4 its not directly comparable as V gene assignment can change depending on the number of bp included etc and measures we take such as number of mutations from IMGT reference rely on them all being a similar length. I have written custom code to trim down the output of TRUST4 reads but I need to preserve the count information to make it comparable to our own counts.

Essentially I want to encode the count information from TRUST4 in the sequence header of the TRUST4 (custom trimmed) sequences which I provide to IMGT. Would the summed counts from the contigs that are pooled into the consensus (so including minor CDR3) from the CDR3 file for all assembly's of the same name be the appropriate one to take?

mourisl commented 2 years ago

The count in the cdr3 and report is with respect to distinct CDR3s. If you want to the read count for each contig, including the minor CDR3s, you shall sum up based on the ids in the trust_cdr3.out file. In the example you showed before, you are right that the count for that contig is 5.

loverend commented 2 years ago

Thank you for your help on this! Can I check to understand what is the cause of fractional counts for CDR3s? e.g. assemble0 0 IGKV1-1302,IGKV1D-1302 IGKJ201 IGKC*01 CAGGGCATTAGGAGTGCT GAAGCCTCC TGTCAACAATTTAATAGTTACCCGTACACTTTT 1 348.62 96.77 1