Closed dcarbajo closed 4 months ago
Thanks for the super prompt response! Yes, so my fastq files are per donor sample, with all the cells from that sample into one file; the idea is to reconstruct the TCR repertoire in that sample (in this case 38M reads from all the cells in that sample).
Is it possible as well to do the assembly by CDR3?
Does your read file have some information about the cell information? Or essentially it is bulk RNA-seq for each donor sample?
Yes, each fastq file is bulk for each donor sample
Is each cell a bulk RNA-seq or targeted TCR-seq? If it is bulk RNA-seq, it shouldn't be this slow. Though it's possible your data is T cell sorted, so there are many TCR reads? In this case, you can add the option "--repseq" to accelerate the procedure.
If your data is TCR-seq, is there any UMI sequence in your data?
Hi! Thanks again for the help, sorry I overlooked the "--repseq" option, I am trying it asap.
To confirm, my data comes from the SMARTer Human TCR a/b Profiling Kit v2
so it is bulk TCR-seq with UMIs. How shall I deal with the UMIs in this case? Cause I would still need a correct frequency estimation, if possible.
Many thanks again!
If you know the range of the UMI, you can regard it as a "barcode" and utilize the "--barcodeLevel molecule" to run TRUST4 in the TCR-seq UMI mode. More details is in the https://github.com/liulab-dfci/TRUST4?tab=readme-ov-file#umi section. Essentially, you shall specify the read file containing of the UMI to the --barcode option, use --readFormat option to specify the range on the read that corresponds to the UMI. TRUST4 then shall handle sequencing error correction and select the best assembly for each UMI. With these commands, TRUST4 should be fast, and, you don't need the "--repseq" option for acceleration unless it is still too slow.
Great! Thanks for the info. So the diagram for this SMARTer sequencing looks like this:
so we have the UMI in the first 12bp of the reads_2.fastq
file.
Based on that, I guess that my final TRUST4 call should look like the following (correct me if I am wrong):
run-trust4 --barcodeLevel molecule
-f hg38_bcrtcr.fa
--ref human_IMGT+C.fa
-1 read_1.fastq
-2 read_2.fastq
--barcode read_2.fastq
--readFormat um:0:12
-o TRUST4
Many thanks again!
Almost, it would be:
run-trust4 --barcodeLevel molecule
-f hg38_bcrtcr.fa
--ref human_IMGT+C.fa
-1 read_1.fastq
-2 read_2.fastq
--barcode read_2.fastq
--readFormat bc:0:11,r2:12:-1
-o TRUST4
Depending on your kit, r2 maybe r2:20:-1 if we don't include the 8bp GTAC and extra 4bp. I think you may also want to remove the first 28 bp from r1 as they maybe primers. Therefore, a conservative readFormat option could be "--readFormat bc:0:11,r2:20:-1,r1:28:-1".
Great! I am going to try that out!
I do a little bit of pre-processing with Skewer, so I will check first how the exact numbers should go, but looks good!
Actually, when I set up the TRUST4 pipeline, I probably wouldn't even need to run Skewer first right?
Can I just send to TRUST4 the raw .fq.gz
files and specify the "--readFormat" option accordingly without any prior adaptor trimming then?
Right, I don't think you need to run Skewer. TRUST4 internally will trim the adapters by detecting read-through events.
Thanks for all the help!
Quick question: is there a parameter with the run-trust4
call above that allows me to only produce the main outputs and not all the intermediate files (like what the smartseq wrapper does)?
Otherwise I have to remove them on the fly, cause I run out of storage space very fast. Thanks!
It's not supported yet. So you may need to write your own script to remove the intermediate files. I will implement this feature in the next release.
great to know! thanks
The feature of removing intermediate files is added and mentioned in the thread #248 . So I'll close this issue for now.
Hello, I am interested in starting using TRUST4, and was wondering what is the best approach to run it on SMARTer data, in terms of parameters, pre-processing, etc.
I run a subset of 100K paired-end reads of one of my samples, in fastq files named
sub1.fastq
andsub2.fastq
used as inputs.I noticed you have a wrapper for SMART-Seq data, and wondered whether this would be suitable for my case.
To try things out, I run TRUST4 in 2 ways:
"Default method":
run-trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 sub1.fastq -2 sub2.fastq -o test1
"SMART-Seq wrapper":
perl trust-smartseq.pl -1 sub1_list.txt -2 sub2_list.txt -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -o test2
(where the txt files just list the location ofsub1.fastq
andsub2.fastq
)While the "default method" produces several outputs (in fasta, tsv, and .out format), the SMART-Seq wrapper only produces the
report.tsv
, theairr.tsv
, and theannot.fa
.What concerns me more, is that the "default" report retrieves several TCRs, the SMART-Seq wrapper only shows the top 2, and the count numbers differ. See below:
"Default method":![Screenshot 2024-02-01 at 11 51 10](https://github.com/liulab-dfci/TRUST4/assets/54791796/cad9e0e4-b723-4883-9bd4-a6e2cc02a9df)
"SMART-Seq wrapper":![Screenshot 2024-02-01 at 11 52 23](https://github.com/liulab-dfci/TRUST4/assets/54791796/0078b0cf-1f12-4273-8046-11bf9caba2c4)
I would appreciate if you could help me understand what the wrapper does happens under the hood, why the results are so different, and mainly what the best way to use TRUST4 on SMARTer data would be.
On a side note, could you explain what the
consensus id full length
means and why it is almost always 0 and sometimes 1?Additionally, I am running this on a subset (of 100K reads) of one sample, but the sample itself is ~38M reads. I am running it with the "default method", but it is still ongoing for 2 days and running (and I have 58 samples), what would be the best way to speed things up if possible?
Many thanks!