liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
280 stars 49 forks source link

Trust4 report less clonotype while sequence depth increase #84

Open wxiang-us opened 3 years ago

wxiang-us commented 3 years ago

Hello,

I was working on a single cell RNA-seq library which contains whole genome reads, not target-enrichment reads for immune profiling. I downsampled the library to 100k, 500k and 2million reads per cell. And found that for the same cell with more reads, the chance that a clonotype was found is actually decreased. For example, a cell could had clonotype identified when it only have 100k reads, but no clonotype identified if analyze the same cell with 500k and 2M reads. I couldn't understand how could this happen. Is this related to trust4's algorithm for consensus generation?

Thanks!

mourisl commented 3 years ago

Is your data smart-seq scRNA-seq data? Could you please show me the commands you used? Thanks.

wxiang-us commented 3 years ago

@mourisl yes, it is smart-seq scRNA-seq data. I was using the default commands ./run_trust4 -1 read1.fastq -2 read2.fastq -f hg38_bcrtcr.fa --ref human_IMGT+C.fa Is there specific parameters best suitable for smart-seq scRNA-seq data? Why the default one give less clonotype with increasing sequence depth? Thanks for your prompt reply!

mourisl commented 3 years ago

Which version of TRUST4 did you use? In the older versions, there was bug with abundance estimation after scaffolding with mate-pair information, which could affect SMART-seq. We provide a script "trust-smartseq.pl" in the repo to process SMART-seq data(https://github.com/liulab-dfci/TRUST4#smart-seq-data), which may save you some time.

One strange thing is that even there is a bug, it should still identify CDR3 sequences. Can you show me a few lines in the _annot.fa file and _cdr3.out file? Thanks.

wxiang-us commented 3 years ago

Trust4 v1.0.4. The test samples are PBMCs, but based on gene expression, we've identified a group of T/B cells. I will pick up a few that should be T/B cells, but don't have CDR3/clonotype identified. Let me get back to you in ~1hr with the _annot.fa file and _cdr3.out file

wxiang-us commented 3 years ago

@mourisl I've added the _anno.fa and _cdr3.out for the same cell but 100k and 500k depth to my data folder please check. Thanks in advance for your help!

mourisl commented 3 years ago

Thanks for providing the files. From those files, from both depths, I can see the clonotype "TGTGCGAGAGATGTGGGGACCGGCCTGACTACACTTTACTACTTTGACTACTGG" for IGH and the corresponding assembly is also full-length. This clonotype should show up in the final report file as well. Could you please check whether the cdr3.out file matched the report.tsv file? You can upload the report.tsv file so I can check it too.

wxiang-us commented 3 years ago

@mourisl Thank you! The tsv files are added.

mourisl commented 3 years ago

I can see the clonotype "TGTGCGAGAGATGTGGGGACCGGCCTGACTACACTTTACTACTTTGACTACTGG" in both report files. Have you post-processed the report files? Or it is another cell have the issue? Thank you.

wxiang-us commented 3 years ago

I've dropped records who are out-of-frame and/or partial, and dropped records whose count equal to 1 (use 2 as cutoff)

mourisl commented 3 years ago

I see. Just want to confirm are the lost clonotypes the low abundant ones, which are not likely to be the clonotype for the cell? Those low abundant clonotypes are usually from sequencing errors, which could be fixed by contig scaffolding or other steps. From what I observe, the real clonotype for the cell is always assembled. In SMART-seq, the filter should pick the most abundant clonotype for each chain instead of removing singleton CDR3s. The trust-smartseq.pl script will pick the representative clonotypes for you.

wxiang-us commented 3 years ago

Thanks! I will use the trust-smartseq.pl script and follow up with you soon. Apperciate your help!

wxiang-us commented 3 years ago

@mourisl Hi, I'm checking details of the tsv files, a question about the freq calculation. I notices the frequency sum up of IG or TR could both be more than 1. For example, in this file 500k tsv The IG freq add up to 2, whilst the TR has a single entry with freq of 1. I thought the IG sum up should be 1, and TR should be 1, and together shouldn't be more than 2. Why in this case total frequency is 3?

wxiang-us commented 3 years ago

@mourisl for usage of script "trust-smartseq.pl", the input files are two read lists: read1_list.txt & read2_list.txt Do you have examples of these files? I guess they list out fastq pairs?

mourisl commented 3 years ago

We computed the frequency for chains sperately. So for IG, there the IGH sums up to 1, the IGK/IGL(light chain) sums to 1. The TR entry probably from noise, and TRB and TRA should sum to 1 respectively.

For the trust-smartseq.pl: yes, each row is for the cell's fastq file.

wxiang-us commented 3 years ago

@mourisl I used the --skipMateExtension, and analyzed with the fastq file I provided. The read count decreased. I don't understand why. I thought skipMateExtension would no longer require pair-read overlap, thus have bigger chance to have more read identify clonotypes. Could you please take a look? Here are the report file cdr3.out annot.fa To clarify, the command was ./run_trust4 -1 read1.fastq -2 read2.fastq -f hg38_bcrtcr.fa --ref human_IMGT+C.fa --skipMateExtension

mourisl commented 3 years ago

The mate extension means scaffolding the contigs with mate-pair information. This will change the results in "trust_final.out", which could lead to different alignment results in the following steps including abundance estimation. The change is very small, so it is fine. We can skip the scaffolding in SMART-seq is because the dominant clonotype is the only real one for a cell and it is usually already nicely assembled, there is no need to extend/fix those low abundant assemblies with mate-pair information to save running time.

wxiang-us commented 3 years ago

@mourisl Thank you! If I understand correct, --skipMateExtension would save processing time, but won't contribute to detection accuracy? If we use bulk samples, not single cells, what parameter setting you would suggest to use?

mourisl commented 3 years ago

Right, for the single-cell data, you can skip the scaffolding step.

The default parameter setting is for bulk samples, so it has already been optimized and no need to change.

wxiang-us commented 3 years ago

@mourisl Thanks very much for your help so far. I tested Trust4 on a bulk library with 3.5million reads, the analysis seems to take a long time, I have to kill it. Do you know what's the possible reason for such long processing time? log info below

process Sample1-1_S2_L001 [Fri Sep 17 11:35:40 2021] TRUST4 begins. [Fri Sep 17 11:35:40 2021] SYSTEM CALL: /tools/TRUST4/fastq-extractor -1 Sample1-1_S2_L001_R1_001.fastq.gz -2 Sample1-1_S2_L001_R2_001.fastq.gz -t 1 -f /tools/TRUST4/hg38_bcrtcr.fa -o Sample1-1_S2_L001_toassemble [Fri Sep 17 11:35:40 2021] Start to extract candidate reads from read files. [Fri Sep 17 11:42:48 2021] Finish extracting reads. [Fri Sep 17 11:42:48 2021] SYSTEM CALL: /tools/TRUST4/trust4 -o Sample1-1_S2_L001 -f /tools/TRUST4/hg38_bcrtcr.fa -1 Sample1-1_S2_L001_toassemble_1.fq -2 Sample1-1_S2_L001_toassemble_2.fq [Fri Sep 17 11:42:51 2021] Read in and count kmers for 100000 reads. [Fri Sep 17 11:42:55 2021] Read in and count kmers for 200000 reads. [Fri Sep 17 11:42:58 2021] Read in and count kmers for 300000 reads. [Fri Sep 17 11:43:02 2021] Read in and count kmers for 400000 reads. [Fri Sep 17 11:43:06 2021] Read in and count kmers for 500000 reads. [Fri Sep 17 11:43:10 2021] Read in and count kmers for 600000 reads. [Fri Sep 17 11:43:14 2021] Read in and count kmers for 700000 reads. [Fri Sep 17 11:43:18 2021] Read in and count kmers for 800000 reads. [Fri Sep 17 11:43:22 2021] Read in and count kmers for 900000 reads. [Fri Sep 17 11:43:26 2021] Read in and count kmers for 1000000 reads. [Fri Sep 17 11:43:30 2021] Read in and count kmers for 1100000 reads. [Fri Sep 17 11:43:35 2021] Read in and count kmers for 1200000 reads. [Fri Sep 17 11:43:39 2021] Read in and count kmers for 1300000 reads. [Fri Sep 17 11:43:44 2021] Read in and count kmers for 1400000 reads. [Fri Sep 17 11:43:48 2021] Read in and count kmers for 1500000 reads. [Fri Sep 17 11:43:53 2021] Read in and count kmers for 1600000 reads. [Fri Sep 17 11:43:57 2021] Read in and count kmers for 1700000 reads. [Fri Sep 17 11:44:02 2021] Read in and count kmers for 1800000 reads. [Fri Sep 17 11:44:07 2021] Read in and count kmers for 1900000 reads. [Fri Sep 17 11:44:12 2021] Read in and count kmers for 2000000 reads. [Fri Sep 17 11:44:16 2021] Read in and count kmers for 2100000 reads. [Fri Sep 17 11:44:22 2021] Read in and count kmers for 2200000 reads. [Fri Sep 17 11:44:27 2021] Read in and count kmers for 2300000 reads. [Fri Sep 17 11:44:31 2021] Read in and count kmers for 2400000 reads. [Fri Sep 17 11:44:36 2021] Read in and count kmers for 2500000 reads. [Fri Sep 17 11:44:41 2021] Read in and count kmers for 2600000 reads. [Fri Sep 17 11:44:46 2021] Read in and count kmers for 2700000 reads. [Fri Sep 17 11:44:51 2021] Read in and count kmers for 2800000 reads. [Fri Sep 17 11:44:56 2021] Read in and count kmers for 2900000 reads. [Fri Sep 17 11:45:01 2021] Read in and count kmers for 3000000 reads. [Fri Sep 17 11:45:07 2021] Read in and count kmers for 3100000 reads. [Fri Sep 17 11:45:13 2021] Read in and count kmers for 3200000 reads. [Fri Sep 17 11:45:19 2021] Read in and count kmers for 3300000 reads. [Fri Sep 17 11:45:25 2021] Read in and count kmers for 3400000 reads. [Fri Sep 17 11:45:30 2021] Read in and count kmers for 3500000 reads. [Fri Sep 17 11:45:36 2021] Read in and count kmers for 3600000 reads. [Fri Sep 17 11:45:42 2021] Read in and count kmers for 3700000 reads. [Fri Sep 17 11:45:48 2021] Read in and count kmers for 3800000 reads. [Fri Sep 17 11:45:53 2021] Read in and count kmers for 3900000 reads. [Fri Sep 17 11:45:59 2021] Read in and count kmers for 4000000 reads. [Fri Sep 17 11:46:05 2021] Read in and count kmers for 4100000 reads. [Fri Sep 17 11:46:11 2021] Read in and count kmers for 4200000 reads. [Fri Sep 17 11:46:18 2021] Read in and count kmers for 4300000 reads. [Fri Sep 17 11:46:23 2021] Read in and count kmers for 4400000 reads. [Fri Sep 17 11:46:29 2021] Read in and count kmers for 4500000 reads. [Fri Sep 17 11:46:35 2021] Read in and count kmers for 4600000 reads. [Fri Sep 17 11:46:41 2021] Read in and count kmers for 4700000 reads. [Fri Sep 17 11:46:47 2021] Read in and count kmers for 4800000 reads. [Fri Sep 17 11:46:53 2021] Read in and count kmers for 4900000 reads. [Fri Sep 17 11:46:59 2021] Read in and count kmers for 5000000 reads. [Fri Sep 17 11:47:05 2021] Read in and count kmers for 5100000 reads. [Fri Sep 17 11:47:11 2021] Read in and count kmers for 5200000 reads. [Fri Sep 17 11:47:17 2021] Read in and count kmers for 5300000 reads. [Fri Sep 17 11:47:23 2021] Read in and count kmers for 5400000 reads. [Fri Sep 17 11:47:30 2021] Read in and count kmers for 5500000 reads. [Fri Sep 17 11:47:36 2021] Read in and count kmers for 5600000 reads. [Fri Sep 17 11:47:42 2021] Read in and count kmers for 5700000 reads. [Fri Sep 17 11:47:48 2021] Read in and count kmers for 5800000 reads. [Fri Sep 17 11:47:54 2021] Read in and count kmers for 5900000 reads. [Fri Sep 17 11:48:01 2021] Read in and count kmers for 6000000 reads. [Fri Sep 17 11:48:07 2021] Read in and count kmers for 6100000 reads. [Fri Sep 17 11:48:13 2021] Read in and count kmers for 6200000 reads. [Fri Sep 17 11:48:19 2021] Read in and count kmers for 6300000 reads. [Fri Sep 17 11:48:26 2021] Read in and count kmers for 6400000 reads. [Fri Sep 17 11:48:32 2021] Read in and count kmers for 6500000 reads. [Fri Sep 17 11:48:38 2021] Read in and count kmers for 6600000 reads. [Fri Sep 17 11:48:44 2021] Read in and count kmers for 6700000 reads. [Fri Sep 17 11:48:51 2021] Read in and count kmers for 6800000 reads. [Fri Sep 17 11:48:57 2021] Read in and count kmers for 6900000 reads. [Fri Sep 17 11:49:03 2021] Read in and count kmers for 7000000 reads. [Fri Sep 17 11:49:09 2021] Read in and count kmers for 7100000 reads. [Fri Sep 17 11:55:12 2021] Found 7058535 reads. [Fri Sep 17 11:55:30 2021] Finish sorting the reads. [Fri Sep 17 12:37:03 2021] Finish rough annotations. [Fri Sep 17 12:37:11 2021] Processed 100000 reads (99978 are used for assembly). [Fri Sep 17 12:37:15 2021] Processed 200000 reads (199950 are used for assembly). [Fri Sep 17 12:37:21 2021] Processed 300000 reads (299837 are used for assembly). [Fri Sep 17 12:37:30 2021] Processed 400000 reads (399469 are used for assembly). [Fri Sep 17 12:37:38 2021] Processed 500000 reads (499413 are used for assembly). [Fri Sep 17 12:37:46 2021] Processed 600000 reads (599381 are used for assembly). [Fri Sep 17 12:37:54 2021] Processed 700000 reads (699356 are used for assembly). [Fri Sep 17 12:38:04 2021] Processed 800000 reads (799338 are used for assembly). [Fri Sep 17 12:38:15 2021] Processed 900000 reads (899293 are used for assembly). [Fri Sep 17 12:38:27 2021] Processed 1000000 reads (999270 are used for assembly). [Fri Sep 17 12:38:40 2021] Processed 1100000 reads (1099247 are used for assembly). [Fri Sep 17 12:38:54 2021] Processed 1200000 reads (1199032 are used for assembly). [Fri Sep 17 12:39:10 2021] Processed 1300000 reads (1298983 are used for assembly). [Fri Sep 17 12:39:27 2021] Processed 1400000 reads (1398901 are used for assembly). [Fri Sep 17 12:39:45 2021] Processed 1500000 reads (1498828 are used for assembly). [Fri Sep 17 12:40:08 2021] Processed 1600000 reads (1598777 are used for assembly). [Fri Sep 17 12:40:28 2021] Processed 1700000 reads (1698693 are used for assembly). [Fri Sep 17 12:40:50 2021] Processed 1800000 reads (1798597 are used for assembly). [Fri Sep 17 12:41:12 2021] Processed 1900000 reads (1898490 are used for assembly). [Fri Sep 17 12:41:35 2021] Processed 2000000 reads (1998075 are used for assembly). [Fri Sep 17 12:42:01 2021] Processed 2100000 reads (2097801 are used for assembly). [Fri Sep 17 12:42:18 2021] Processed 2200000 reads (2197482 are used for assembly). [Fri Sep 17 12:42:35 2021] Processed 2300000 reads (2297076 are used for assembly). [Fri Sep 17 12:42:54 2021] Processed 2400000 reads (2396635 are used for assembly). [Fri Sep 17 12:43:17 2021] Processed 2500000 reads (2496112 are used for assembly). [Fri Sep 17 12:43:42 2021] Processed 2600000 reads (2595318 are used for assembly). [Fri Sep 17 12:44:08 2021] Processed 2700000 reads (2694554 are used for assembly). [Fri Sep 17 12:44:35 2021] Processed 2800000 reads (2793944 are used for assembly). [Fri Sep 17 12:45:09 2021] Processed 2900000 reads (2893255 are used for assembly). [Fri Sep 17 12:45:42 2021] Processed 3000000 reads (2992666 are used for assembly). [Fri Sep 17 12:46:18 2021] Processed 3100000 reads (3091965 are used for assembly). [Fri Sep 17 12:46:57 2021] Processed 3200000 reads (3191364 are used for assembly). [Fri Sep 17 12:47:42 2021] Processed 3300000 reads (3290672 are used for assembly). [Fri Sep 17 12:48:25 2021] Processed 3400000 reads (3390001 are used for assembly). [Fri Sep 17 12:49:19 2021] Processed 3500000 reads (3489281 are used for assembly). [Fri Sep 17 12:50:22 2021] Processed 3600000 reads (3588479 are used for assembly). [Fri Sep 17 12:51:26 2021] Processed 3700000 reads (3687697 are used for assembly). [Fri Sep 17 12:52:29 2021] Processed 3800000 reads (3786843 are used for assembly). [Fri Sep 17 12:53:42 2021] Processed 3900000 reads (3885978 are used for assembly). [Fri Sep 17 12:54:58 2021] Processed 4000000 reads (3985108 are used for assembly). [Fri Sep 17 12:56:20 2021] Processed 4100000 reads (4084226 are used for assembly). [Fri Sep 17 12:57:47 2021] Processed 4200000 reads (4183187 are used for assembly). [Fri Sep 17 12:59:16 2021] Processed 4300000 reads (4282254 are used for assembly). [Fri Sep 17 13:00:57 2021] Processed 4400000 reads (4381491 are used for assembly). [Fri Sep 17 13:02:49 2021] Processed 4500000 reads (4480652 are used for assembly). [Fri Sep 17 13:04:37 2021] Processed 4600000 reads (4579897 are used for assembly). [Fri Sep 17 13:06:50 2021] Processed 4700000 reads (4679044 are used for assembly). [Fri Sep 17 13:08:56 2021] Processed 4800000 reads (4778068 are used for assembly). [Fri Sep 17 13:10:52 2021] Processed 4900000 reads (4877253 are used for assembly). [Fri Sep 17 13:12:40 2021] Processed 5000000 reads (4976344 are used for assembly). [Fri Sep 17 13:14:24 2021] Processed 5100000 reads (5075380 are used for assembly). [Fri Sep 17 13:16:01 2021] Processed 5200000 reads (5174691 are used for assembly). [Fri Sep 17 13:17:41 2021] Processed 5300000 reads (5273659 are used for assembly). [Fri Sep 17 13:19:35 2021] Processed 5400000 reads (5372736 are used for assembly). [Fri Sep 17 13:21:56 2021] Processed 5500000 reads (5471632 are used for assembly). [Fri Sep 17 13:24:37 2021] Processed 5600000 reads (5570450 are used for assembly). [Fri Sep 17 13:28:17 2021] Processed 5700000 reads (5669612 are used for assembly). [Fri Sep 17 13:34:16 2021] Processed 5800000 reads (5768472 are used for assembly). [Fri Sep 17 13:43:48 2021] Processed 5900000 reads (5867499 are used for assembly). [Fri Sep 17 13:57:07 2021] Processed 6000000 reads (5966298 are used for assembly). [Fri Sep 17 14:15:02 2021] Processed 6100000 reads (6065356 are used for assembly). [Fri Sep 17 14:36:52 2021] Processed 6200000 reads (6164045 are used for assembly). [Fri Sep 17 15:07:54 2021] Processed 6300000 reads (6262443 are used for assembly). [Fri Sep 17 15:46:38 2021] Processed 6400000 reads (6362442 are used for assembly). [Fri Sep 17 17:26:50 2021] Processed 6500000 reads (6459956 are used for assembly). [Fri Sep 17 19:44:11 2021] Processed 6600000 reads (6559954 are used for assembly). [Fri Sep 17 23:01:21 2021] Processed 6700000 reads (6659949 are used for assembly). [Sat Sep 18 04:03:11 2021] Processed 6800000 reads (6759446 are used for assembly). [Sat Sep 18 08:56:36 2021] Processed 6900000 reads (6857285 are used for assembly). [Sat Sep 18 11:09:36 2021] Processed 7000000 reads (6955708 are used for assembly). [Sat Sep 18 11:24:07 2021] Assembled 7001153 reads. [Sat Sep 18 11:24:07 2021] Try to rescue 0 reads for assembly. [Sat Sep 18 11:24:08 2021] Rescued 0 reads. [Sat Sep 18 12:38:45 2021] Processed 100000 reads for extension. [Sat Sep 18 14:01:46 2021] Processed 200000 reads for extension. [Sat Sep 18 15:29:48 2021] Processed 300000 reads for extension. [Sat Sep 18 16:56:27 2021] Processed 400000 reads for extension. [Sat Sep 18 18:20:46 2021] Processed 500000 reads for extension. [Sat Sep 18 19:51:23 2021] Processed 600000 reads for extension. [Sat Sep 18 21:16:02 2021] Processed 700000 reads for extension. [Sat Sep 18 22:41:34 2021] Processed 800000 reads for extension. [Sun Sep 19 00:09:26 2021] Processed 900000 reads for extension. [Sun Sep 19 01:36:09 2021] Processed 1000000 reads for extension. [Sun Sep 19 03:03:17 2021] Processed 1100000 reads for extension. [Sun Sep 19 04:27:38 2021] Processed 1200000 reads for extension. [Sun Sep 19 06:02:43 2021] Processed 1300000 reads for extension. [Sun Sep 19 07:31:22 2021] Processed 1400000 reads for extension. [Sun Sep 19 08:58:56 2021] Processed 1500000 reads for extension. [Sun Sep 19 10:35:15 2021] Processed 1600000 reads for extension. [Sun Sep 19 12:05:39 2021] Processed 1700000 reads for extension. [Sun Sep 19 13:37:01 2021] Processed 1800000 reads for extension. [Sun Sep 19 15:19:22 2021] Processed 1900000 reads for extension. [Sun Sep 19 16:52:24 2021] Processed 2000000 reads for extension. [Sun Sep 19 18:28:40 2021] Processed 2100000 reads for extension. [Sun Sep 19 19:59:32 2021] Processed 2200000 reads for extension. [Sun Sep 19 21:28:09 2021] Processed 2300000 reads for extension. [Sun Sep 19 22:59:41 2021] Processed 2400000 reads for extension. [Mon Sep 20 00:23:33 2021] Processed 2500000 reads for extension. [Mon Sep 20 01:42:56 2021] Processed 2600000 reads for extension. [Mon Sep 20 03:02:33 2021] Processed 2700000 reads for extension. [Mon Sep 20 04:20:41 2021] Processed 2800000 reads for extension. [Mon Sep 20 05:46:09 2021] Processed 2900000 reads for extension. [Mon Sep 20 07:06:43 2021] Processed 3000000 reads for extension. [Mon Sep 20 08:25:10 2021] Processed 3100000 reads for extension. [Mon Sep 20 09:46:26 2021] Processed 3200000 reads for extension. [Mon Sep 20 11:10:19 2021] Processed 3300000 reads for extension. [Mon Sep 20 12:32:42 2021] Processed 3400000 reads for extension. [Mon Sep 20 13:50:44 2021] Processed 3500000 reads for extension. [Mon Sep 20 15:12:39 2021] Processed 3600000 reads for extension. [Mon Sep 20 16:30:15 2021] Processed 3700000 reads for extension. [Mon Sep 20 17:43:52 2021] Processed 3800000 reads for extension. [Mon Sep 20 19:03:47 2021] Processed 3900000 reads for extension. [Mon Sep 20 20:19:06 2021] Processed 4000000 reads for extension. [Mon Sep 20 21:35:07 2021] Processed 4100000 reads for extension. [Mon Sep 20 22:54:59 2021] Processed 4200000 reads for extension. [Tue Sep 21 00:07:14 2021] Processed 4300000 reads for extension. [Tue Sep 21 01:27:10 2021] Processed 4400000 reads for extension. [Tue Sep 21 02:45:08 2021] Processed 4500000 reads for extension. [Tue Sep 21 03:59:03 2021] Processed 4600000 reads for extension. [Tue Sep 21 05:19:32 2021] Processed 4700000 reads for extension. [Tue Sep 21 06:42:05 2021] Processed 4800000 reads for extension. [Tue Sep 21 08:06:19 2021] Processed 4900000 reads for extension. [Tue Sep 21 09:31:01 2021] Processed 5000000 reads for extension. [Tue Sep 21 11:02:20 2021] Processed 5100000 reads for extension. [Tue Sep 21 12:28:48 2021] Processed 5200000 reads for extension. [Tue Sep 21 13:56:51 2021] Processed 5300000 reads for extension. [Tue Sep 21 15:38:38 2021] Processed 5400000 reads for extension. [Tue Sep 21 17:30:27 2021] Processed 5500000 reads for extension. [Tue Sep 21 19:41:21 2021] Processed 5600000 reads for extension. [Tue Sep 21 22:36:37 2021] Processed 5700000 reads for extension. [Wed Sep 22 03:01:35 2021] Processed 5800000 reads for extension. [Wed Sep 22 07:29:30 2021] Processed 5900000 reads for extension.

mourisl commented 3 years ago

Is this TCR-seq or BCR-seq data? For these TCR/BCR amplified bulk data set, you may want to try the option --repseq for faster processing. It looks like you are running TRUST4 with a single thread, you can try multi-threading (-t N) to further increase the speed.

wxiang-us commented 3 years ago

@mourisl It's BCR-seq. I will try both --repseq and -t. Thanks!

wxiang-us commented 3 years ago

@mourisl should the command be ./run_trust4 -1 read1.fastq -2 read2.fastq -f hg38_bcrtcr.fa --ref human_IMGT+C.fa --repseq BCR-seq -t 30 It returned error saying "Unknown parameter BCR-seq". I'm using the latest verstion TRUST4-1.0.5.1

mourisl commented 3 years ago

You can run it with ./run_trust4 -1 read1.fastq -2 read2.fastq -f hg38_bcrtcr.fa --ref human_IMGT+C.fa --repseq -t 30 . (--repseq is a flag, it does not need parameter).

wxiang-us commented 3 years ago

@mourisl Thank you! Why --repseq makes analysis faster? What algorithm was changed/incorporated? In addition, I found that the latest version "Fix a serious bug in the run-trust4 wrapper in the original v1.0.5 release for single-cell data." Is this bug in v1.0.4? I had completed quite some analysis with v1.0.4, just wondering if there is need to rerun. Thanks!

mourisl commented 3 years ago
  1. --repseq will trim the read ends when TRUST4 found sequences might be outside V, J area. Furthermore, internally, it will only consider overlaps with compatible V genes and J genes. Therefore, with "repseq" option, the read-contig overlaps are much less, with a tradeoff of slightly less sensitivity.
  2. The bug is not in v1.0.4. The bug was that TRUST4 did not generate the cdr3 file and report files given the --barcode option.
wxiang-us commented 3 years ago

@mourisl Thanks again for the clarification!

wxiang-us commented 3 years ago

@mourisl I'm trying to visualize the Trust4 alignment and assembling results by matching with the germline sequences. But couldn't figure out a way to use sequences & scores in trust_cdr3.out & trust_annot.fa for visualization purpose. Do you have suggestion for tools I could use? Thanks again!

mourisl commented 3 years ago

You can use IMGT website or IgBLAST to align the sequences from trust_annot.fa to the germline database.

The "./annotator" program in TRUST4 also have the option "--geneAlignment". When using this option, in the sequence header of the trust_annot.fa it will also output 3 fields for the alignment on V, J, C regions. For each part, "=" represent match, capital nucleotide indicate the nucleotide in the assembly is different than the germline. "\" indicate deletion on the assembly and lower-case indicates the inserted nucleotides in the assembly. The alignment are between the ranges shown in the headers of trust_annot.fa.