Seg fault in correcting reads

drk0311 commented 11 months ago

I tried wit 700 Mbp genome and got a fantastic results! Thank you!!

Now I'm now working on a plant genome of ~4 Gbp and got segmentation fault problem. So if it is using too much memory, is there a way to save it?

The error message is like below;

/data/grc1/KanAoi/C1/PECAT/scripts/crr0_correct_1.sh: line 16: 1914431 Segmentation fault (core dumped) /home/naitok785/miniconda3/envs/pecat/share/pecat-0.0.3-0/bin/fsa_rd_correct /data/grc1/KanAoi/C1/PECAT/1-correct/0/rd2rd.txt.sub.1.paf /data/grc1/KanAoi/C1/PECAT/0-prepare/prepared_reads.fasta /data/grc1/KanAoi/C1/PECAT/1-correct/0/corrected_reads_0.fasta.1 --output_directory=/data/grc1/KanAoi/C1/PECAT/1-correct/0 --thread_size=36 --read_name_fname=/data/grc1/KanAoi/C1/PECAT/1-correct/0/readname.core.1 --infos_fname /data/grc1/KanAoi/C1/PECAT/1-correct/0/corrected_reads_0.fasta.1.infos --score=weight:lc=15 --aligner edlib:bs=1000:mc=6 --min_coverage 4 --filter1 oh=1000:ohr=0.01 --candidate n=600:f=30 --min_identity 90 --min_local_identity 80 Plgd script end: 2023-12-13 11:03:58

the cfgfile is as below.

project=PECAT`
reads=./KanAoi_C1.fastq.gz
genome_size=4500000000
threads=36
cleanup=1
grid=local
prep_min_length=3000
prep_output_coverage=40
corr_iterate_number=1
corr_block_size=4000000000
corr_correct_options=--score=weight:lc=15 --aligner edlib:bs=1000:mc=6  --min_coverage 4 --filter1 oh=1000:ohr=0.01 --candidate n=600:f=30 --min_identity 90 --min_local_identity 80
corr_filter_options=--filter0=l=5000:al=2500:alr=0.5:aal=8000:oh=3000:ohr=0.3
corr_rd2rd_options=-x ava-ont -f 0.005 -I 10G
corr_output_coverage=40
align_block_size=12000000000
align_rd2rd_options=-X -g3000 -w30 -k19 -m100 -r500 -I 10G -f 0.002
align_filter_options=--filter0=l=5000:aal=6000:aalr=0.5:oh=3000:ohr=0.3 --task=extend --filter1=oh=300:ohr=0.03 --min_identity 0.95
asm1_assemble_options=--max_trivial_length 10000 --min_identity 0.95 --min_coverage 4 --min_contig_length 10000
phase_method=2
phase_rd2ctg_options=-x map-ont -c -p 0.5 -r 1000 -w 10 -k 19 -r 1000 -I 10G -K 8G
phase_use_reads=1
phase_phase_options= --coverage lc=10 --phase_options icr=0.1:icc=8:sc=10
phase_filter_options=--threshold 1000
phase_clair3_command=singularity exec --containall -B `pwd -P`:`pwd -P` clair3_v0.1-r12.sif /opt/bin/run_clair3.sh
phase_clair3_use_reads=0
phase_clair3_rd2ctg_options=-x map-ont -w10 -k19 -c -p 0.5 -r 1000 -I 10G -K 8G
phase_clair3_phase_options=--coverage lc=20 --phase_options icr=0.1:icc=3:sc=10 --filter i=90
phase_clair3_filter_options=--threshold=2500 --rate 0.05
phase_clair3_options=--platform=ont --model_path=/opt/models/r1041_e82_400bps_sup_v420 --include_all_ctgs
asm2_assemble_options=--reducer0 "best:cmp=2,0.1,0.1|phase:sc=3" --contig_format prialt,dual
polish_map_options=-x map-ont -w10 -k19 -I 10g -K 8G -a
polish_filter_options=--filter0 oh=2000:ohr=0.2:i=96
polish_cns_options=

lemene commented 11 months ago

hi, @drk0311. It may be out of memory. Can you show the size of files in the directory 1-correct/0? Blocks can be divided into smaller ones to reduce memory usage, for example corr_block_size=1000000000. You need to delete the directory 1-correct/0 first before running pecat again.

drk0311 commented 10 months ago

Thank you! The PAF files in 1-corect/0 are more than 300 GB, maybe too big, right. I'll try smaller corr_block_size option and see what happens.

stubbsrl commented 7 months ago

Hello - I had a similar problem also using a sample with a large genome (6Gb) and using a smaller corr_block_size helped me. However, after solving this first problem I made it to crr0_correctafter 187 hours of running and 16Tb of intermediate files, but I ran out of memory again. Is there anyway to restart from where I'm at? In the scripts folder it looks like I've completed 86/212 of the crr0_correct_*.sh scripts. If the program doesn't have a built in way to restart, could I just kick off each of the remaining crr0correct*.sh scripts manually? I'm only interested in these precorrection steps from PECAT and it looks like I'm almost done. I really hope there is a way I can restart.

lemene commented 7 months ago

hi, @stubbsrl PECAT can continue to run at the breakpoint using the same command pecat.pl correct cfg. Using the cleanup=1 can reduce intermediate files.

stubbsrl commented 7 months ago

Thank you @lemene! I used both cleanup=1 and adding the parameter -f 0.005 to corr_rd2rd_options and align_rd2rd_options and at the peak I had 16Tb of files. It's now down to 12Tb as the program is using those large files and deleting them.

stubbsrl commented 6 months ago

The analysis made it through 211 of the 212 blocks, but ran out of memory on one block (block #210). I copied over rd2rd.txt.sub.210.paf, prepared_reads.fasta, readname.core.210, and the crr0_correct_210.sh script to a computer with 500 gb memory. I changed the paths in the script and kicked off the analysis crr0_correct_210.sh script. It runs for 1 hour and 45 mins and still runs out of memory. I looked at the help flag for fsa_rd_correct and I tried in separate attempts adding the --use_cacheflag, reducing --thread_size, and increasing --min_coverage. Any suggestions on how to get this final file to be corrected? Any other changes I can make to the crr0_correct_210.sh script? Could I split the paf somehow? My plan is once this paf turns into a fasta to move all of the #210 files back with the other corrected fastas and finish the analysis. Thanks in advanced for any suggestions!

lemene commented 6 months ago

Hi @stubbsrl Reducing --thread_size should be helpful. Moving files back is feasible. But you need to create file crr0_correct_210.sh.done and write 0 in the file.

stubbsrl commented 6 months ago

Hi @lemene thanks for the feedback. I've tried various thread sizes all the way down to --thread_size 20 and I am still getting segmentation faults/core dumps. Any other suggestions?

lemene commented 6 months ago

Hi @stubbsrl You can use /usr/bin/time - v to check if the segmentation fault is caused by lack of memory. There is another method to reduce memory usage by dividing the file --read_name_fname=xxx/1-correct/0/readname.core.210 into multiple files. Then you can manually correct the reads separately and merge the results. PECAT only corrects the reads whose names is in readname.core.xxx. In this way, PECAT doesn't load irrelevant overlaps.

jp-jong commented 6 months ago

Hi! I encountered similar error here. I'm trying to test this out with a bacteria with just 5.5mb size but got aborted core errors. I'd be very much happy to receive your inputs about this especially that I'm interested to try this program on a plant with ~400mb size.

Here's my options for correction `project= smarcescens reads= smarcescens_simplex.filtered.fastq genome_size= 5500000 threads=4 cleanup=1 grid=local

prep_min_length=3000 prep_output_coverage=80

corr_iterate_number=1 corr_block_size=4000000000 corr_filter_options=--filter0=l=5000:al=2500:alr=0.5:aal=5000:oh=1000:ohr=0.1 corr_correct_options=--score=weight:lc=10 --aligner edlib --filter1 oh=1000:ohr=0.01 corr_rd2rd_options=-x ava-ont corr_output_coverage=80`

Here's the terminal log. 2024-05-05 01:32:40 [INFO] Start Correcting 2024-05-05 01:32:40 [INFO] thread size 4, totalsize 12211 2024-05-05 01:32:40 [INFO] done 0, all 12211 free(): double free detected in tcache 2 /data/user/test_assemblers/smarcescens/scripts/crr0_correct_0.sh: line 16: 1473021 Aborted (core dumped) /data/user/miniconda3/envs/pecat/share/pecat-0.0.1-1/bin/fsa_rd_correct /data/user/test_assemblers/smarcescens/1-correct/0/rd2rd.txt.sub.0.paf /data/user/test_assemblers/smarcescens/0-prepare/prepared_reads.fasta /data/user/test_assemblers/smarcescens/1-correct/0/corrected_reads_0.fasta.0 --output_directory=/data/user/test_assemblers/smarcescens/1-correct/0 --thread_size=4 --read_name_fname=/data/user/test_assemblers/smarcescens/1-correct/0/readname.core.0 --infos_fname /data/user/test_assemblers/smarcescens/1-correct/0/corrected_reads_0.fasta.0.infos --score=weight:lc=10 --aligner edlib --filter1 oh=1000:ohr=0.01 Plgd script end: 2024-05-05 01:32:41 2024-05-05 01:32:43 [Error] Failed to run correcting reads 0, crr0

Here's the /usr/bin/time -v output Command exited with non-zero status 1 Command being timed: "pecat.pl correct cfgfile_1" User time (seconds): 506.21 System time (seconds): 9.50 Percent of CPU this job got: 322% Elapsed (wall clock) time (h:mm:ss or m:ss): 2:40.15 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1534148 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 4461422 Voluntary context switches: 67738 Involuntary context switches: 5650 Swaps: 0 File system inputs: 3696 File system outputs: 1768432 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 1

lemene commented 6 months ago

Hi @jp-jong From the message free(): double free detected in tcache 2, it seems that this is not a lack of memory , but a bug in PECAT. But I cannot locate it. I suggest you try the version v0.0.3.

stubbsrl commented 6 months ago

I checked and I am using v0.0.3. Here is the output using /usr/bin/time - v. Also note I'm just running the shell script for the one paf where I keep getting a segmentation fault and I made separate readname files as you suggested (you'll see I'm using readname.core1.210). The machine I'm using has 500gb mem.

/usr/bin/time -v bash crr0_correct_210_1.sh
Plgd script start: 2024-05-09 21:55:06
2024-05-09 21:55:06 [INFO] Start 
2024-05-09 21:55:06 [INFO] Arguments: 
  ol_fname = /mydir/pecat/rd2rd.txt.sub.210.paf
  rr_fname = /mydir/pecat/prepared_reads.fasta 
  cr_fname = /mydir/pecat/corrected_reads_0.fasta.210
  filter0 = l=2000:i=0.00:al=2000:alr=0.50:oh=10000000:ohr=1.00:aal=10000000:aalr=1.00:ilid=0
  filter1 = l=0:i=0.00:al=0:alr=1.00:oh=1000:ohr=0.01:aal=10000000:aalr=1.00:ilid=0    
  graph_fname =
  infos_fname = /mydir/pecat/corrected_reads_0.fasta.210.inf
os
  output_directory = /mydir/pecat 
  read_name = 
  read_name_fname = /mydir/pecat/readname.core1.210  
  aligner = edlib:bs=1000:mc=6   
  score = weight:lc=10
  candidate = c=80:n=600:f=30:p=0.95:ohwt=0.1   
  min_identity = 90 
  min_local_identity = 80  
  thread_size = 70  
  min_coverage = 4  
  variants = 
  use_cache = 0
  debug = 0
  debug_flag = 0
  skip_branch_check = 0

2024-05-09 21:55:06 [INFO] Memory: 9124
2024-05-09 22:10:49 [INFO] Load 1896657424 overlaps from file /mydir/pecat/rd2rd.txt.sub.210.paf
2024-05-09 22:16:16 [INFO] Load 18244710 reads from file: /mydir/pecat/prepared_reads.fasta
2024-05-09 22:16:17 [INFO] Memory: 194866120
2024-05-09 22:57:00 [INFO] BuildIndex: memory=224501524
2024-05-09 22:57:00 [INFO] Memory: 224501524
2024-05-09 22:57:00 [INFO] Start Correcting
2024-05-09 22:57:00 [INFO] thread size 70, totalsize 33965
2024-05-09 22:57:00crr0_correct_210_1.sh: line 16: 1544632 Segmentation fault (core dumped) /mydir/miniconda3/envs/pecat-env/share/pecat-0.0.3-0/bin/fsa_rd_correct /mydir/pecat/rd2rd.txt.sub.210.paf /mydir/pecat/prepared_reads.fasta /mydir/pecat/correct
ed_reads_0.fasta.210 --output_directory=/mydir/pecat --threa
d_size=70 --read_name_fname=/mydir/pecat/readname.core1.210 
--infos_fname /mydir/pecat/corrected_reads_0.fasta.210.infos
 --score=weight:lc=10 --aligner edlib:bs=1000:mc=6 --min_coverage 4 --filter1 oh=1000:ohr=0.01 --can
didate n=600:f=30 --min_identity 90 --min_local_identity 80
Plgd script end: 2024-05-09 22:57:15
 Command being timed: "bash crr0_correct_210_1.sh"
 User time (seconds): 10257.79
 System time (seconds): 1073.89
 Percent of CPU this job got: 303%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 1:02:08
 Average shared text size (kbytes): 0
 Average unshared data size (kbytes): 0
 Average stack size (kbytes): 0
 Average total size (kbytes): 0
 Maximum resident set size (kbytes): 224502344
 Average resident set size (kbytes): 0
 Major (requiring I/O) page faults: 139
 Minor (reclaiming a frame) page faults: 662291363
 Voluntary context switches: 48761
 Involuntary context switches: 1138676
 Swaps: 0
 File system inputs: 1874387864
 File system outputs: 8
 Socket messages sent: 0
 Socket messages received: 0
 Signals delivered: 0
 Page size (bytes): 4096
 Exit status: 0

lemene commented 6 months ago

@stubbsrl I still think that this error seems to be caused by the lack of memory. Is this machine running other programs simultaneously? You can try using fewer threads, such as -- thread_size=10, and reducing the number of reads in the file readname.core1.210, such as only containing 100 reads. If it still fails, please add the parameter '--debug', and keep only one read in readname.core1.210, and run bash crr0_correct_210_1.sh > debug_log. debug_log will contain more information to help locate the bug.

jp-jong commented 6 months ago

@lemene the v0.0.3 seems to work without errors. Thanks!!

stubbsrl commented 5 months ago

I set the thread_size=10 and also set my script to occupy the entire node. I broke readname.core.210 into 6 different files and ran each one separately. This worked! Thank you for your help!

After it finished I concatenated the 6 corrected_reads_0.fasta.210 into one file and did the same for the .infos file. These concatenated files were moved back to 1-correct/0 with the other corrected reads. I also made a crr0_correct_210.sh.done file in the scripts dir and put a 0 in it.

I run pecat.pl correct cfgfile and this is the output and error. I must have missed something or deleted something because I think it's gone back a step to splitting readnames.

(pecat-env) rstubbs/projects/pecat$ pecat.pl correct cfgfile 
2024-05-29 21:53:12 [Info] Start running job prp_job, preparing reads
2024-05-29 21:53:12 [Info] Skip preparing reads for outputs are newer.
2024-05-29 21:53:12 [Info] Start running job crr_job, correcting reads, crr
2024-05-29 21:53:12 [Info] Start correcting reads, crr.
2024-05-29 21:53:12 [Info] Job::Serial::submit_core crr_job
2024-05-29 21:53:12 [Info] Start running job crr0_correct, correcting rawreads, crr0
2024-05-29 21:53:12 [Info] Start correcting rawreads, crr0.
2024-05-29 21:53:12 [Info] Job::Serial::submit_core crr0_correct
2024-05-29 21:53:12 [Info] Start running job crr0_rd2rd_job, aligning reads to reads, crr0
2024-05-29 21:53:12 [Info] Skip aligning reads to reads, crr0 for outputs are newer.
2024-05-29 21:53:12 [Info] Start running job crr0_split, spliting read names, crr0
2024-05-29 21:53:12 [Info] Start spliting read names, crr0.
Argument "" isn't numeric in addition (+) at /rstubbs/my_conda/envs/pecat-env/share/pecat-0.0.3-0/bin/Plgd/Job/Script.pm line 47, <F> line 3.
2024-05-29 21:53:12 [Info] Run script /rstubbs/projects/genome/pecat/genome/scripts/crr0_split.sh
2024-05-29 21:53:12 [Info] Sumbit command: /rstubbs/projects/genome/pecat/genome/scripts/crr0_split.sh 2>&1 | tee /rstubbs/projects/genome/pecat/genome/scripts/crr0_split.sh.log
Plgd script start: 2024-05-29 21:53:12
2024-05-29 21:53:12 [INFO] Start
2024-05-29 21:53:12 [INFO] Arguments: 
  reads = /rstubbs/projects/genome/pecat/genome/0-prepare/prepared_reads.fasta
  sub_reads = /rstubbs/projects/genome/pecat/genome/1-correct/0/readname.core.{}
  block_size = 1000000000
  base_size = 234000000000
  thread_size = 70
  method = plain
  overlaps = /rstubbs/projects/genome/pecat/genome/1-correct/0/rd2rd.txt
  sub_overlaps = /rstubbs/projects/genome/pecat/genome/1-correct/0/rd2rd.txt.sub.{}.paf

2024-05-29 21:53:12 [INFO] load read names
2024-05-29 21:55:50 [INFO] minimum length = 3000
2024-05-29 21:55:52 [INFO] read size = 18244710
2024-05-29 21:55:52 [INFO] Save read names
2024-05-29 21:55:58 [INFO] Save overlaps
2024-05-29 21:55:58 [ERROR] Failed to open file: /rstubbs/projects/genome/pecat/genome/1-correct/0/cc0.fasta.paf
/rstubbs/projects/genome/pecat/genome/scripts/crr0_split.sh: line 28: 414344 Aborted                 /rstubbs/my_conda/envs/pecat-env/share/pecat-0.0.3-0/bin/fsa_rd_tools split_name /rstubbs/projects/genome/pecat/genome/0-prepare/prepared_reads.fasta /rstubbs/projects/genome/pecat/genome/1-correct/0/readname.core.{} --block_size 1000000000 --base_size 234000000000 --overlaps /rstubbs/projects/genome/pecat/genome/1-correct/0/rd2rd.txt --sub_overlaps /rstubbs/projects/genome/pecat/genome/1-correct/0/rd2rd.txt.sub.{}.paf --thread_size 70
Plgd script end: 2024-05-29 21:55:58
2024-05-29 21:56:02 [Error] Failed to run spliting read names, crr0

lemene commented 5 months ago

@stubbsrl Have the files 1-correct/0/corrected_reads_X.fasta been deleted? Can you show me what other files are in the directory 1-correct/0/?

stubbsrl commented 5 months ago

No I still have the corrected reads fastas.

Inside the 1-correct/0/dir I have: corrected_reads_0.fasta.X, corrected_reads_0.fasta.X.infos, and rd2rd.txt. Yesterday when I tried to restart pecat it made readname.core.X and rd2rd.txt.sub.X.paf in that dir.

lemene commented 5 months ago

crr0_split.sh is the previous step and should not be executed again, otherwise the corrected reads (corrected_reads_0.fasta.X) will be deleted. The time of the file may have been incorrect, and PECAT did not accurately recognize the current status. If you don't want to discard the corrected reads, it's best to manually execute the scripts.

stubbsrl commented 5 months ago

oh wow I'm very glad corrected reads didn't get deleted. So, what is the next script that I should execute manually? Everything in the scripts dir has a done file with it so nothing jumps out to me.

lemene commented 5 months ago

Each file crr0_correct_X.sh should be executed and the corresponding files corrected_reads_0.fasta.X, corrected_reads_0.fasta.X.infos will be generated. corrected_reads_0.fasta.X are corrected reads. They are concatenated to a single file as the output of the correction step. The crr0_correct_X.sh requires file rd2rd.txt.sub.X.paf. rd2rd.txt.sub.X.paf are generated from ccX.fasta.paf by crr0_split.sh. Is rd2rd.txt.sub.X.paf deleted? It's a bit cumbersome to regenerate them. Following are scripts in execution order and the associated input and output files. crr0_rd2rd_split.sh: ../../0-prepare/prepared_reads.fasta -> ccX.fasta crr0_rd2rd_index.sh: ../../0-prepare/prepared_reads.fasta -> reads.index crr0_rd2rd_align_X.sh: ccX.fasta , reads.index-> ccX.fasta.paf crr0_split.sh: ccX.fasta.paf -> rd2rd.txt.sub.X.paf, readname.core.X crr0_correct_X.sh: rd2rd.txt.sub.X.paf, readname.core.X -> corrected_reads_0.fasta.X, corrected_reads_0.fasta.X.infos crr0_cat.sh: corrected_reads_0.fasta.X, -> ../corrected_reads.fasta To regenerate rd2rd.txt.sub.X.paf, all ccX.fasta.paf are required，which means you need to redo the alignment between the reads.

`

drk0311 commented 2 months ago

Hi, I am the guy who opened this issue. The cause of core dump or segmentation fault issues often come from organelle genome (chloroplast and mitochondrion). So if you are working with a plant genome of quite a large size (>3 Gbp), you'd better remove the reads derived from organelle genomes from your fastq files. Maybe your fastq files contain more then 10,000x of the chloroplast genome!!

So, if the reference organelle genomes are available, map your reads to it, extract unmapped reads or those mapped only <50% of its read length, and then run PECAT.

lemene / PECAT

Seg fault in correcting reads #20