liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
268 stars 46 forks source link

increasing time for each 100k batch of reads #161

Open mehdiborji opened 1 year ago

mehdiborji commented 1 year ago

I am using trust4 on fastq of a nonenriched 3' single cell library. After first step of read extraction this how the assembly process unfolds. There seems be an increasing time between each 100k batch of candidate reads. This is probably going to take days to complete. I wonder what is going wrong here

[Sat Oct 22 18:43:47 2022] Processed 100000 reads (0 are used for assembly). [Sat Oct 22 18:43:47 2022] Processed 200000 reads (0 are used for assembly). [Sat Oct 22 18:43:47 2022] Processed 300000 reads (0 are used for assembly). [Sat Oct 22 18:43:47 2022] Processed 400000 reads (0 are used for assembly). [Sat Oct 22 18:43:48 2022] Processed 500000 reads (198 are used for assembly). [Sat Oct 22 18:43:48 2022] Processed 600000 reads (592 are used for assembly). [Sat Oct 22 18:43:48 2022] Processed 700000 reads (1041 are used for assembly). [Sat Oct 22 18:43:48 2022] Processed 800000 reads (1691 are used for assembly). [Sat Oct 22 18:43:48 2022] Processed 900000 reads (1973 are used for assembly). [Sat Oct 22 18:43:48 2022] Processed 1000000 reads (5846 are used for assembly). [Sat Oct 22 18:43:49 2022] Processed 1100000 reads (14708 are used for assembly). [Sat Oct 22 18:43:50 2022] Processed 1200000 reads (20512 are used for assembly). [Sat Oct 22 18:43:54 2022] Processed 1300000 reads (39780 are used for assembly). [Sat Oct 22 18:44:02 2022] Processed 1400000 reads (63267 are used for assembly). [Sat Oct 22 18:44:21 2022] Processed 1500000 reads (96348 are used for assembly). [Sat Oct 22 18:44:35 2022] Processed 1600000 reads (118067 are used for assembly). [Sat Oct 22 18:44:59 2022] Processed 1700000 reads (145635 are used for assembly). [Sat Oct 22 18:45:41 2022] Processed 1800000 reads (186384 are used for assembly). [Sat Oct 22 18:46:36 2022] Processed 1900000 reads (226705 are used for assembly). [Sat Oct 22 18:47:53 2022] Processed 2000000 reads (270498 are used for assembly). [Sat Oct 22 18:49:32 2022] Processed 2100000 reads (320416 are used for assembly). [Sat Oct 22 18:51:33 2022] Processed 2200000 reads (374709 are used for assembly). [Sat Oct 22 18:54:20 2022] Processed 2300000 reads (433091 are used for assembly). [Sat Oct 22 18:57:36 2022] Processed 2400000 reads (493558 are used for assembly). [Sat Oct 22 19:01:37 2022] Processed 2500000 reads (559999 are used for assembly). [Sat Oct 22 19:06:30 2022] Processed 2600000 reads (628998 are used for assembly). [Sat Oct 22 19:12:06 2022] Processed 2700000 reads (697691 are used for assembly). [Sat Oct 22 19:18:54 2022] Processed 2800000 reads (766119 are used for assembly). [Sat Oct 22 19:28:02 2022] Processed 2900000 reads (839536 are used for assembly). [Sat Oct 22 19:36:46 2022] Processed 3000000 reads (913556 are used for assembly). [Sat Oct 22 19:43:55 2022] Processed 3100000 reads (987998 are used for assembly). [Sat Oct 22 19:51:52 2022] Processed 3200000 reads (1062932 are used for assembly) [Sat Oct 22 20:00:55 2022] Processed 3300000 reads (1139765 are used for assembly). [Sat Oct 22 20:11:00 2022] Processed 3400000 reads (1217334 are used for assembly). [Sat Oct 22 20:22:00 2022] Processed 3500000 reads (1292302 are used for assembly). [Sat Oct 22 20:33:44 2022] Processed 3600000 reads (1366472 are used for assembly). [Sat Oct 22 20:47:09 2022] Processed 3700000 reads (1444088 are used for assembly). [Sat Oct 22 21:01:13 2022] Processed 3800000 reads (1521414 are used for assembly). [Sat Oct 22 21:16:32 2022] Processed 3900000 reads (1599686 are used for assembly). [Sat Oct 22 21:37:01 2022] Processed 4000000 reads (1677805 are used for assembly). [Sat Oct 22 21:59:43 2022] Processed 4100000 reads (1756142 are used for assembly). [Sat Oct 22 22:24:08 2022] Processed 4200000 reads (1834353 are used for assembly). [Sat Oct 22 22:56:21 2022] Processed 4300000 reads (1912468 are used for assembly). [Sat Oct 22 23:32:02 2022] Processed 4400000 reads (1990422 are used for assembly). [Sun Oct 23 00:06:29 2022] Processed 4500000 reads (2069086 are used for assembly). [Sun Oct 23 00:48:37 2022] Processed 4600000 reads (2148191 are used for assembly).

mourisl commented 1 year ago

In the assembly process, TRUST4 needs to align the candidate read to more and more assembled contigs. So it requires a longer time for later batches. The running time should be quite fast for single-cell data. Did you run TRUST4 with "--barcode" option?

Note that the sensitivity of assembling CDR3 is quite low for 3' data because VDJ region is on the 5' end of the mRNA.

mehdiborji commented 1 year ago

I did run it with --barcode option. It took forever so I stopped it.

~/TRUST4/run-trust4 -f ~/TRUST4/hg38_bcrtcr.fa --ref ~/TRUST4/human_IMGT+C.fa -u *_R2_*.fastq.gz --barcode *_R1_*.fastq.gz --barcodeRange 0 15 + What option should I use here?

My data is very deeply sequenced and there's also internal priming into C gene, as such there's quite a few reads that hit the CDR3 even though it's 3' capture. MiXCR is able to extract 500 clones from same data in about ~1 hour!

mourisl commented 1 year ago

Another possibility is there are non-genomic sequences in R2, which makes the underlying assembly very messy. Could you please also add the option "--repseq" and TRUST4 will try to clean the sequences internally? You can add the option "--stage 1" to reuse the extracted reads from previous step.

You can also add the option "-t 8" for parallelization. It will help other steps in TRUST4.

mourisl commented 1 year ago

Hi @mehdiborji , I just want to check whether the --repseq option help with the speed issue.