Best way to run TRUST4 on SMARTer data

dcarbajo commented 5 months ago

Hello, I am interested in starting using TRUST4, and was wondering what is the best approach to run it on SMARTer data, in terms of parameters, pre-processing, etc.

I run a subset of 100K paired-end reads of one of my samples, in fastq files named sub1.fastq and sub2.fastq used as inputs.

I noticed you have a wrapper for SMART-Seq data, and wondered whether this would be suitable for my case.

To try things out, I run TRUST4 in 2 ways:

"Default method": run-trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 sub1.fastq -2 sub2.fastq -o test1

"SMART-Seq wrapper": perl trust-smartseq.pl -1 sub1_list.txt -2 sub2_list.txt -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -o test2 (where the txt files just list the location of sub1.fastq and sub2.fastq)

While the "default method" produces several outputs (in fasta, tsv, and .out format), the SMART-Seq wrapper only produces the report.tsv, the airr.tsv, and the annot.fa.

What concerns me more, is that the "default" report retrieves several TCRs, the SMART-Seq wrapper only shows the top 2, and the count numbers differ. See below:

"Default method": Screenshot 2024-02-01 at 11 51 10

"SMART-Seq wrapper": Screenshot 2024-02-01 at 11 52 23

I would appreciate if you could help me understand what the wrapper does happens under the hood, why the results are so different, and mainly what the best way to use TRUST4 on SMARTer data would be.

On a side note, could you explain what the consensus id full length means and why it is almost always 0 and sometimes 1?

Additionally, I am running this on a subset (of 100K reads) of one sample, but the sample itself is ~38M reads. I am running it with the "default method", but it is still ongoing for 2 days and running (and I have 58 samples), what would be the best way to speed things up if possible?

Many thanks!

mourisl commented 5 months ago

For SMART-seq-like type of data, since we only expect to be a pair of chains per cell, so it selected the pair with the highest abundance as representative (can be changed through the --representative option) after running regular TRUST4 internally. The extra TCRs are likely to be the other non-functional chain, sequencing artifacts, or assembly artifacts.
The other files, like _final.out are intermediate files, so I did not keep them in the smartseq wrapper.
Since the number of BCR/TCR reads are likely to be high in SMART-seq data and there is no class switch recombination in a cell, there is no need for extend the contigs with mate pair information. You can see the "--skipMateExtension" in the smartseq wrapper. This will create different assembly results to the default TRUST4, and may affect the abundance estimation.
cid_full_length is to represent whether the corresponding contig (cid) is full length or not. Full length means 5' of V genesto 3' of J gene.
Does SMARTer seq put all the cells data into one fastq file, or each cell has its own fastq file? Do you mean you have 38M reads for one cell?

dcarbajo commented 5 months ago

Thanks for the super prompt response! Yes, so my fastq files are per donor sample, with all the cells from that sample into one file; the idea is to reconstruct the TCR repertoire in that sample (in this case 38M reads from all the cells in that sample).

Is it possible as well to do the assembly by CDR3?

mourisl commented 5 months ago

Does your read file have some information about the cell information? Or essentially it is bulk RNA-seq for each donor sample?

dcarbajo commented 5 months ago

Yes, each fastq file is bulk for each donor sample

mourisl commented 5 months ago

Is each cell a bulk RNA-seq or targeted TCR-seq? If it is bulk RNA-seq, it shouldn't be this slow. Though it's possible your data is T cell sorted, so there are many TCR reads? In this case, you can add the option "--repseq" to accelerate the procedure.

If your data is TCR-seq, is there any UMI sequence in your data?

dcarbajo commented 5 months ago

Hi! Thanks again for the help, sorry I overlooked the "--repseq" option, I am trying it asap. To confirm, my data comes from the SMARTer Human TCR a/b Profiling Kit v2 so it is bulk TCR-seq with UMIs. How shall I deal with the UMIs in this case? Cause I would still need a correct frequency estimation, if possible. Many thanks again!

mourisl commented 5 months ago

If you know the range of the UMI, you can regard it as a "barcode" and utilize the "--barcodeLevel molecule" to run TRUST4 in the TCR-seq UMI mode. More details is in the https://github.com/liulab-dfci/TRUST4?tab=readme-ov-file#umi section. Essentially, you shall specify the read file containing of the UMI to the --barcode option, use --readFormat option to specify the range on the read that corresponds to the UMI. TRUST4 then shall handle sequencing error correction and select the best assembly for each UMI. With these commands, TRUST4 should be fast, and, you don't need the "--repseq" option for acceleration unless it is still too slow.

dcarbajo commented 5 months ago

Great! Thanks for the info. So the diagram for this SMARTer sequencing looks like this:

SMARTer-Human-TCRv2-dark

so we have the UMI in the first 12bp of the reads_2.fastq file.

Based on that, I guess that my final TRUST4 call should look like the following (correct me if I am wrong):

run-trust4 --barcodeLevel molecule
           -f hg38_bcrtcr.fa
           --ref human_IMGT+C.fa
           -1 read_1.fastq
           -2 read_2.fastq
           --barcode read_2.fastq
           --readFormat um:0:12
           -o TRUST4

Many thanks again!

mourisl commented 5 months ago

Almost, it would be:

run-trust4 --barcodeLevel molecule
           -f hg38_bcrtcr.fa
           --ref human_IMGT+C.fa
           -1 read_1.fastq
           -2 read_2.fastq
           --barcode read_2.fastq
           --readFormat bc:0:11,r2:12:-1
           -o TRUST4

Depending on your kit, r2 maybe r2:20:-1 if we don't include the 8bp GTAC and extra 4bp. I think you may also want to remove the first 28 bp from r1 as they maybe primers. Therefore, a conservative readFormat option could be "--readFormat bc:0:11,r2:20:-1,r1:28:-1".

dcarbajo commented 5 months ago

Great! I am going to try that out!

I do a little bit of pre-processing with Skewer, so I will check first how the exact numbers should go, but looks good!

Actually, when I set up the TRUST4 pipeline, I probably wouldn't even need to run Skewer first right?

Can I just send to TRUST4 the raw .fq.gz files and specify the "--readFormat" option accordingly without any prior adaptor trimming then?

mourisl commented 5 months ago

Right, I don't think you need to run Skewer. TRUST4 internally will trim the adapters by detecting read-through events.

dcarbajo commented 5 months ago

Thanks for all the help!

dcarbajo commented 5 months ago

Quick question: is there a parameter with the run-trust4 call above that allows me to only produce the main outputs and not all the intermediate files (like what the smartseq wrapper does)? Otherwise I have to remove them on the fly, cause I run out of storage space very fast. Thanks!

mourisl commented 5 months ago

It's not supported yet. So you may need to write your own script to remove the intermediate files. I will implement this feature in the next release.

dcarbajo commented 5 months ago

great to know! thanks

mourisl commented 4 months ago

The feature of removing intermediate files is added and mentioned in the thread #248 . So I'll close this issue for now.

liulab-dfci / TRUST4

Best way to run TRUST4 on SMARTer data #247