liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
272 stars 47 forks source link

Occasional segmentation fault during annotation step #22

Closed jvivian-atreca closed 4 years ago

jvivian-atreca commented 4 years ago

Hi, thank you for the work on this really interesting tool!

I have a set of RNA-seq samples that I'm iterating over to call TRUST4 on. What is odd is that while most samples run end-to-end without any issue, several of my samples are failing during the annotation step with a segmentation fault, which is a rather opaque failure state so I don't have much insight into why it is failing. Here is an example of the log output for two samples where one succeeds followed by one that fails during the annotation step:

Succeeds

[Mon Jul 13 22:37:26 2020] TRUST4 begins.
[Mon Jul 13 22:37:26 2020] SYSTEM CALL: /home/ubuntu/TRUST4/fastq-extractor -1 samples/PC-1-1_1.fq -2 samples/PC-1-1_2.fq -t 16 -f TRUST4/mouse/mouse_IMGT+C.fa -o output/PC-1-1/PC-1-1_toa
ssemble
[Mon Jul 13 22:37:26 2020] Start to extract candidate reads from read files.
[Mon Jul 13 22:43:50 2020] Finish extracting reads.
[Mon Jul 13 22:43:50 2020] SYSTEM CALL: /home/ubuntu/TRUST4/trust4  -f TRUST4/mouse/mouse_IMGT+C.fa -o output/PC-1-1/PC-1-1 -t 16 -1 output/PC-1-1/PC-1-1_toassemble_1.fq -2 output/PC-1-1/
PC-1-1_toassemble_2.fq
[Mon Jul 13 22:43:53 2020] Read in and count kmers for 100000 reads.
[Mon Jul 13 22:43:55 2020] Read in and count kmers for 200000 reads.
[Mon Jul 13 22:43:58 2020] Read in and count kmers for 300000 reads.
[Mon Jul 13 22:44:01 2020] Read in and count kmers for 400000 reads.
[Mon Jul 13 22:44:11 2020] Found 406533 reads.
[Mon Jul 13 22:44:14 2020] Finish sorting the reads.
[Mon Jul 13 22:44:19 2020] Finish rough annotations.
[Mon Jul 13 22:44:19 2020] Processed 100000 reads (19559 are used for assembly).
[Mon Jul 13 22:44:23 2020] Processed 200000 reads (30992 are used for assembly).
[Mon Jul 13 22:44:34 2020] Processed 300000 reads (39967 are used for assembly).
[Mon Jul 13 22:44:45 2020] Processed 400000 reads (44026 are used for assembly).
[Mon Jul 13 22:44:45 2020] Assembled 44044 reads.
[Mon Jul 13 22:44:45 2020] Try to rescue 12906 reads for assembly.
[Mon Jul 13 22:44:49 2020] Rescued 226 reads.
[Mon Jul 13 22:44:50 2020] Extend assemblies by mate pair information.
[Mon Jul 13 22:44:51 2020] Remove redundant assemblies.
[Mon Jul 13 22:44:52 2020] Finish assembly.
[Mon Jul 13 22:44:52 2020] SYSTEM CALL: /home/ubuntu/TRUST4/annotator -f TRUST4/mouse/mouse_IMGT+C.fa -a output/PC-1-1/PC-1-1_final.out -t 16 -o output/PC-1-1/PC-1-1  -r output/PC-1-1/PC-
1-1_assembled_reads.fa > output/PC-1-1/PC-1-1_annot.fa
[Mon Jul 13 22:44:52 2020] Start to annotate assemblies.
[Mon Jul 13 22:44:53 2020] Start to realign reads for CDR3 analysis.
[Mon Jul 13 22:44:54 2020] Compute CDR3 abundance.
[Mon Jul 13 22:44:54 2020] Finish annotation.
[Mon Jul 13 22:44:54 2020] SYSTEM CALL: perl /home/ubuntu/TRUST4/trust-simplerep.pl output/PC-1-1/PC-1-1_cdr3.out  > output/PC-1-1/PC-1-1_report.tsv
[Mon Jul 13 22:44:54 2020] TRUST4 finishes.

Fails

[Mon Jul 13 22:44:54 2020] TRUST4 begins.
[Mon Jul 13 22:44:54 2020] SYSTEM CALL: /home/ubuntu/TRUST4/fastq-extractor -1 samples/PC-1-6_1.fq -2 samples/PC-1-6_2.fq -t 16 -f TRUST4/mouse/mouse_IMGT+C.fa -o output/PC-1-6/PC-1-6_toa
ssemble
[Mon Jul 13 22:44:54 2020] Start to extract candidate reads from read files.
[Mon Jul 13 22:51:25 2020] Finish extracting reads.
[Mon Jul 13 22:51:25 2020] SYSTEM CALL: /home/ubuntu/TRUST4/trust4  -f TRUST4/mouse/mouse_IMGT+C.fa -o output/PC-1-6/PC-1-6 -t 16 -1 output/PC-1-6/PC-1-6_toassemble_1.fq -2 output/PC-1-6/
PC-1-6_toassemble_2.fq
[Mon Jul 13 22:51:27 2020] Read in and count kmers for 100000 reads.
[Mon Jul 13 22:51:30 2020] Read in and count kmers for 200000 reads.
[Mon Jul 13 22:51:33 2020] Read in and count kmers for 300000 reads.
[Mon Jul 13 22:51:36 2020] Read in and count kmers for 400000 reads.
[Mon Jul 13 22:51:39 2020] Read in and count kmers for 500000 reads.
[Mon Jul 13 22:51:58 2020] Found 568504 reads.
[Mon Jul 13 22:52:01 2020] Finish sorting the reads.
[Mon Jul 13 22:52:08 2020] Finish rough annotations.
[Mon Jul 13 22:52:08 2020] Processed 100000 reads (25516 are used for assembly).
[Mon Jul 13 22:52:09 2020] Processed 200000 reads (61138 are used for assembly).
[Mon Jul 13 22:52:16 2020] Processed 300000 reads (82026 are used for assembly).
[Mon Jul 13 22:52:25 2020] Processed 400000 reads (94354 are used for assembly).
[Mon Jul 13 22:52:49 2020] Processed 500000 reads (107588 are used for assembly).
[Mon Jul 13 22:53:01 2020] Assembled 110498 reads.
[Mon Jul 13 22:53:01 2020] Try to rescue 11402 reads for assembly.
[Mon Jul 13 22:53:06 2020] Rescued 584 reads.
[Mon Jul 13 22:53:08 2020] Extend assemblies by mate pair information.
[Mon Jul 13 22:53:10 2020] Remove redundant assemblies.
[Mon Jul 13 22:53:11 2020] Finish assembly.
[Mon Jul 13 22:53:11 2020] SYSTEM CALL: /home/ubuntu/TRUST4/annotator -f TRUST4/mouse/mouse_IMGT+C.fa -a output/PC-1-6/PC-1-6_final.out -t 16 -o output/PC-1-6/PC-1-6  -r output/PC-1-6/PC-1-6_assembled_reads.fa > output/PC-1-6/PC-1-6_annot.fa
[Mon Jul 13 22:53:11 2020] Start to annotate assemblies.
Segmentation fault (core dumped)
system /home/ubuntu/TRUST4/annotator -f TRUST4/mouse/mouse_IMGT+C.fa -a output/PC-1-6/PC-1-6_final.out -t 16 -o output/PC-1-6/PC-1-6  -r output/PC-1-6/PC-1-6_assembled_reads.fa > output/PC-1-6/PC-1-6_annot.fa failed: 35584 at ./TRUST4/run-trust4 line 37.
samples/PC-2-3

TRUST4 Script

#!/usr/bin/env bash
proc=`nproc`
git clone https://github.com/liulab-dfci/TRUST4.git
mkdir -p output
for i in $(ls samples/*.fq | rev | cut -c 6- | rev | uniq)
    do
        echo ${i}
        base=`basename ${i}`
        mkdir -p output/${base}
        ./TRUST4/run-trust4 \
            -1 ${i}_1.fq \
            -2 ${i}_2.fq \
            -f TRUST4/mouse/mouse_IMGT+C.fa \
            --ref TRUST4/mouse/mouse_IMGT+C.fa \
            -o output/${base}/${base} \
            -t ${proc}
    done

These samples are all processed the exact same way, so the intermittent failure is somewhat puzzling to me. I will post an update if I figure out how to get the annotation step to run successfully.

Thank you for your time, John

mourisl commented 4 years ago

Thanks for sharing the results.

Does the file output/PC-1-6/PC-1-6_annot.fa contain any assemblies annotations?

jvivian-atreca commented 4 years ago

Hi @mourisl , thank you for your reply. No, for all samples where it fails at the annotation step the _annot.fa file is empty.

-rw-rw-r--  1 ubuntu ubuntu   647831 Jul 13 21:37 DOR-0-3_annot.fa
-rw-rw-r--  1 ubuntu ubuntu   447623 Jul 13 21:45 DOR-0-4_annot.fa
-rw-rw-r--  1 ubuntu ubuntu   369792 Jul 13 21:52 DOR-0-7_annot.fa
-rw-rw-r--  1 ubuntu ubuntu        0 Jul 13 21:59 NC-1-2_annot.fa
-rw-rw-r--  1 ubuntu ubuntu   319042 Jul 13 22:05 NC-1-4_annot.fa
-rw-rw-r--  1 ubuntu ubuntu   295966 Jul 13 22:13 NC-1-5_annot.fa
-rw-rw-r--  1 ubuntu ubuntu   573115 Jul 13 22:20 NC-1-6_annot.fa
-rw-rw-r--  1 ubuntu ubuntu        0 Jul 13 22:29 NC-2-1_annot.fa
-rw-rw-r--  1 ubuntu ubuntu   537733 Jul 13 22:37 NC-2-7_annot.fa
-rw-rw-r--  1 ubuntu ubuntu   819282 Jul 13 22:44 PC-1-1_annot.fa
-rw-rw-r--  1 ubuntu ubuntu        0 Jul 13 22:53 PC-1-6_annot.fa
-rw-rw-r--  1 ubuntu ubuntu        0 Jul 13 23:01 PC-2-3_annot.fa
-rw-rw-r--  1 ubuntu ubuntu   887110 Jul 13 23:09 PC-2-5_annot.fa
-rw-rw-r--  1 ubuntu ubuntu        0 Jul 13 23:16 XP-1-4_annot.fa
-rw-rw-r--  1 ubuntu ubuntu   762502 Jul 13 23:23 XP-1-5_annot.fa
-rw-rw-r--  1 ubuntu ubuntu        0 Jul 13 23:31 XP-1-7_annot.fa
-rw-rw-r--  1 ubuntu ubuntu   842778 Jul 13 23:39 XP-1-8_annot.fa
-rw-rw-r--  1 ubuntu ubuntu        0 Jul 13 23:48 XP-2-1_annot.fa
-rw-rw-r--  1 ubuntu ubuntu         0 Jul 13 23:58 XP-2-3_annot.fa
-rw-rw-r--  1 ubuntu ubuntu        0 Jul 14 01:43 XP-2-5_annot.fa
mourisl commented 4 years ago

Can you share one of the _final.out files that failed to generate the _annot.fa file? Thank you.

jvivian-atreca commented 4 years ago

Hi @mourisl ,

Thank you — I'm checking with my supervisor about sharing the sequences, so will let you know shortly.

I noticed while looking at the files that all the failed samples were >= 4.4MB:

-rw-rw-r--  1 ubuntu ubuntu 2.9M Jul 13 21:37 DOR-0-3_final.out. 
-rw-rw-r--  1 ubuntu ubuntu 2.0M Jul 13 21:45 DOR-0-4_final.out
-rw-rw-r--  1 ubuntu ubuntu 1.7M Jul 13 21:52 DOR-0-7_final.out
-rw-rw-r--  1 ubuntu ubuntu 4.8M Jul 13 21:59 NC-1-2_final.out   <=======
-rw-rw-r--  1 ubuntu ubuntu 1.4M Jul 13 22:05 NC-1-4_final.out
-rw-rw-r--  1 ubuntu ubuntu 1.4M Jul 13 22:13 NC-1-5_final.out
-rw-rw-r--  1 ubuntu ubuntu 2.6M Jul 13 22:20 NC-1-6_final.out
-rw-rw-r--  1 ubuntu ubuntu 6.2M Jul 13 22:29 NC-2-1_final.out   <=======
-rw-rw-r--  1 ubuntu ubuntu 2.5M Jul 13 22:37 NC-2-7_final.out
-rw-rw-r--  1 ubuntu ubuntu 3.7M Jul 13 22:44 PC-1-1_final.out
-rw-rw-r--  1 ubuntu ubuntu 7.2M Jul 13 22:53 PC-1-6_final.out   <=======
-rw-rw-r--  1 ubuntu ubuntu 8.4M Jul 13 23:01 PC-2-3_final.out   <=======
-rw-rw-r--  1 ubuntu ubuntu 4.0M Jul 13 23:09 PC-2-5_final.out
-rw-rw-r--  1 ubuntu ubuntu 4.4M Jul 13 23:16 XP-1-4_final.out   <=======
-rw-rw-r--  1 ubuntu ubuntu 3.4M Jul 13 23:23 XP-1-5_final.out
-rw-rw-r--  1 ubuntu ubuntu 5.1M Jul 13 23:31 XP-1-7_final.out   <=======
-rw-rw-r--  1 ubuntu ubuntu 3.8M Jul 13 23:39 XP-1-8_final.out
-rw-rw-r--  1 ubuntu ubuntu 7.4M Jul 13 23:48 XP-2-1_final.out   <=======
-rw-rw-r--  1 ubuntu ubuntu 8.5M Jul 13 23:58 XP-2-3_final.out   <=======
-rw-rw-r--  1 ubuntu ubuntu 6.6M Jul 14 01:43 XP-2-5_final.out   <=======

I then took one of the failed samples and cut it in half:

(base) ubuntu@ip-10-200-0-18:~$ wc -l output/NC-1-2/NC-1-2_final.out
12900 output/NC-1-2/NC-1-2_final.out

head -n 6450 output/NC-1-2/NC-1-2_final.out >output/NC-1-2/foo_final.out

(base) ubuntu@ip-10-200-0-18:~$ ./TRUST4/annotator -f TRUST4/mouse/mouse_IMGT+C.fa -a output/NC-1-2/foo_final.out -t 16 -o output/NC-1-2/foo -r output/NC-1-2/NC-1-2_assembled_reads.fa > output/NC-1-2/foo_annot.fa
[Tue Jul 14 20:33:18 2020] Start to annotate assemblies.
[Tue Jul 14 20:33:18 2020] Start to realign reads for CDR3 analysis.
[Tue Jul 14 20:33:19 2020] Compute CDR3 abundance.
[Tue Jul 14 20:33:19 2020] Finish annotation.

This also worked if I tailed the second half of the file. This machine has 32G of RAM so its not an issue with memory... Any thoughts?

mourisl commented 4 years ago

Interesting. The annotation shouldn't take much memory so memory should not be the issue. Can you try to run it with single thread (-t 1)?

jvivian-atreca commented 4 years ago

Hi @mourisl — I should have included that in the previous comment, but I tried that next and it still failed. I don't know Perl or I would try to take a look, but is it possible there's a fixed-size data structure that is being overflowed?

(base) ubuntu@ip-10-200-0-18:~$ ./TRUST4/annotator -f TRUST4/mouse/mouse_IMGT+C.fa -a output/NC-1-2/NC-1-2_final.out -t 1 -o output/NC-1-2/foo -r output/NC-1-2/NC-1-2_assembled_reads.fa > output/NC-1-2/foo_annot.fa
[Tue Jul 14 20:40:33 2020] Start to annotate assemblies.
Segmentation fault (core dumped)
jvivian-atreca commented 4 years ago

Hi @mourisl — I got permission to share one of the files in case it is helpful for debugging: NC-1-2_final.out

mourisl commented 4 years ago

Thank you! I'll check this right away!

mourisl commented 4 years ago

It finished successfully on my computer. Can you try to run it without option "-r"? If this also fails, I would guess the executable I uploaded might not be fully compatible with your system. You can try the Singularity image (similar to docker but does not require root permission) in the release. There is a brief introduction about Singularity in the README. I created this image on Ubuntu, it's fairly straightforward to use Singularity.

jvivian-atreca commented 4 years ago

Hi @mourisl — Thank you for taking the time to run that file. I'm on an Ubuntu EC2 machine, but will try both removing the -r option as well as the singularity container and will report back.

jvivian-atreca commented 4 years ago

Hi @mourisl ,

I'm running into the same issue using the singularity container:

(base) ubuntu@ip-10-200-0-18:~$ singularity exec trust4-singularity.sif /TRUST4/run-trust4 -1 samples/NC-1-2_1.fq -2 samples/NC-1-2_2.fq -f TRUST4/mouse/mouse_IMGT+C.fa --ref TRUST4/mouse/mouse_IMGT+C.fa -o output/NC-1-2/NC-1-2 -t 16
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
[Tue Jul 14 21:41:14 2020] TRUST4 begins.
[Tue Jul 14 21:41:14 2020] SYSTEM CALL: /TRUST4/fastq-extractor -1 samples/NC-1-2_1.fq -2 samples/NC-1-2_2.fq -t 16 -f TRUST4/mouse/mouse_IMGT+C.fa -o output/NC-1-2/NC-1-2_toassemble
[Tue Jul 14 21:41:14 2020] Start to extract candidate reads from read files.
[Tue Jul 14 21:47:01 2020] Finish extracting reads.
[Tue Jul 14 21:47:01 2020] SYSTEM CALL: /TRUST4/trust4  -f TRUST4/mouse/mouse_IMGT+C.fa -o output/NC-1-2/NC-1-2 -t 16 -1 output/NC-1-2/NC-1-2_toassemble_1.fq -2 output/NC-1-2/NC-1-2_toassemble_2.fq
[Tue Jul 14 21:47:02 2020] Read in and count kmers for 100000 reads.
[Tue Jul 14 21:47:05 2020] Read in and count kmers for 200000 reads.
[Tue Jul 14 21:47:07 2020] Read in and count kmers for 300000 reads.
[Tue Jul 14 21:47:10 2020] Read in and count kmers for 400000 reads.
[Tue Jul 14 21:47:22 2020] Found 453867 reads.
[Tue Jul 14 21:47:25 2020] Finish sorting the reads.
[Tue Jul 14 21:47:30 2020] Finish rough annotations.
[Tue Jul 14 21:47:30 2020] Processed 100000 reads (25984 are used for assembly).
[Tue Jul 14 21:47:33 2020] Processed 200000 reads (47874 are used for assembly).
[Tue Jul 14 21:47:41 2020] Processed 300000 reads (59554 are used for assembly).
[Tue Jul 14 21:47:56 2020] Processed 400000 reads (69360 are used for assembly).
[Tue Jul 14 21:48:03 2020] Assembled 71191 reads.
[Tue Jul 14 21:48:03 2020] Try to rescue 9180 reads for assembly.
[Tue Jul 14 21:48:08 2020] Rescued 177 reads.
[Tue Jul 14 21:48:09 2020] Extend assemblies by mate pair information.
[Tue Jul 14 21:48:10 2020] Remove redundant assemblies.
[Tue Jul 14 21:48:11 2020] Finish assembly.
[Tue Jul 14 21:48:11 2020] SYSTEM CALL: /TRUST4/annotator -f TRUST4/mouse/mouse_IMGT+C.fa -a output/NC-1-2/NC-1-2_final.out -t 16 -o output/NC-1-2/NC-1-2  -r output/NC-1-2/NC-1-2_assembled_reads.fa > output/NC-1-2/NC-1-2_annot.fa
[Tue Jul 14 21:48:11 2020] Start to annotate assemblies.
Segmentation fault (core dumped)
system /TRUST4/annotator -f TRUST4/mouse/mouse_IMGT+C.fa -a output/NC-1-2/NC-1-2_final.out -t 16 -o output/NC-1-2/NC-1-2  -r output/NC-1-2/NC-1-2_assembled_reads.fa > output/NC-1-2/NC-1-2_annot.fa failed: 35584 at /TRUST4/run-trust4 line 37.

Trying the singularity container with minimum commands:

(base) ubuntu@ip-10-200-0-18:~$ singularity exec trust4-singularity.sif /TRUST4/annotator -f /TRUST4/mouse/mouse_IMGT+C.fa -a output/NC-1-2/NC-1-2_final.out
[Tue Jul 14 22:27:22 2020] Start to annotate assemblies.
Segmentation fault (core dumped)

Does the container version work for you?

mourisl commented 4 years ago

I think I've figured out the bug. Can you pull the GitHub repo again and give it a try? Thanks.

jvivian-atreca commented 4 years ago

Everything ran without issue — thank you for such a quick patch!

mourisl commented 4 years ago

Thanks for sharing the file! Helped A LOT in the debugging.

ywigelman commented 4 years ago

first, thank you so much for TRUST4 - it's a great tool

I've experienced a similar issue. I'm trying to run TRUST4 over 124 RepSeq, RNA, samples starting from fastq files (pair-end). the run of most samples was successful however it fails in others during the annotation step:

(python3.6) yoav@zelda:~$ /home/zel/yoav/TRUST4/annotator -f /home/zel/yoav/TRUST4/mouse/mouse_IMGT+C.fa -a /home/zel/yoav/GBM_data/TCR_seq/TRUST4_all/7_final.out -t 1 -o /home/zel/yoav/GBM_data/TCR_seq/TRUST4_all/7 -r /home/zel/yoav/GBM_data/TCR_seq/TRUST4_all/7_assembled_reads.fa > /home/zel/yoav/GBM_data/TCR_seq/TRUST4_all/7_annot.fa [Sun Aug 23 12:59:25 2020] Start to annotate assemblies. [Sun Aug 23 13:12:14 2020] Start to realign reads for CDR3 analysis. [Sun Aug 23 13:17:00 2020] Realigned 100000 reads. [Sun Aug 23 13:20:44 2020] Realigned 200000 reads. [Sun Aug 23 13:24:16 2020] Realigned 300000 reads. [Sun Aug 23 13:27:29 2020] Realigned 400000 reads. [Sun Aug 23 13:34:11 2020] Realigned 500000 reads. [Sun Aug 23 13:35:15 2020] Compute CDR3 abundance. double free or corruption (!prev) Aborted (core dumped)

here's a link to the final file in case it helps 7_final.out

thanks in advance for your assistance,

Yoav