lemene / PECAT

PECAT, a phased error correct and assembly tool
BSD 2-Clause "Simplified" License
36 stars 0 forks source link

Failed to run assembling overlaps, asm1 #34

Open enriquepola1996 opened 1 month ago

enriquepola1996 commented 1 month ago

Hello dear developers,

I am using PECAT to assemble a diploid genome (SMRT Cell, PacBio), however, the process is stuck at the assembly stage, the following error appears:

2024-07-31 11:03:57 [INFO] Load 1389389 reads from file: /data/car_manu/PECAT/demo/plant_manu/1-correct/corrected_reads.fasta
2024-07-31 11:03:59 [INFO] Load overlap file
2024-07-31 11:35:12 [INFO] Overlap size: 2293116649/3861383282
2024-07-31 11:35:20 [INFO] Group overlaps and remove duplicated
/data/car_manu/PECAT/demo/plant_manu/scripts/asm1_assemble.sh: line 16: 1840799 Killed                  /data/car_manu/PECAT/build/bin/fsa_ol_assemble /data/car_manu/PECAT/demo/plant_manu/2-align/overlaps.txt --thread_size=20 --output_directory=/data/car_manu/PECAT/demo/plant_manu/3-assemble --read_file=/data/car_manu/PECAT/demo/plant_manu/1-correct/corrected_reads.fasta --filter0 l=10000:al=5000
Plgd script end: 2024-07-31 11:35:47
2024-07-31 11:35:52 [Error] Failed to run assembling overlaps, asm1

My script:

project=plant_manu
reads=../../../Manuel/test.fasta.gz
genome_size=400000000
threads=20
cleanup=1
grid=local

prep_min_length=3000
prep_output_coverage=80

corr_iterate_number=1
corr_block_size=4000000000
corr_filter_options=--filter0=:al=5000:alr=0.5:aal=8000:aalr=0.5:oh=2000:ohr=0.2
corr_correct_options=--score=weight:lc=10 --aligner diff --filter1 oh=1000:ohr=0.01
corr_rd2rd_options=-x ava-pb
corr_output_coverage=80

align_block_size=4000000000
align_rd2rd_options=-X -g3000 -w30 -k19 -m100 -r500
align_filter_options=--filter0=l=3000:al=3000:alr=0.5:aalr=0.5:oh=1000:ohr=0.1 --task=extend --filter1=oh=100:ohr=0.01

asm1_assemble_options=--filter0 l=10000:al=5000

phase_rd2ctg_options=-x map-pb -c -p 0.5 -r 1000
phase_use_reads=1
phase_phase_options= --phase_options icr=0.2
phase_filter_options= --threshold=1000

asm2_assemble_options=--filter0 l=10000:al=5000 --contig_format dual,prialt

polish_use_reads=1
polish_map_options = -x asm20
polish_cns_options =

Can you help me please? Thanks.

lemene commented 1 month ago

Hi @enriquepola1996 The most likely cause is insufficient memory. You can verify this conclusion by running /var/bin/time -v asm1_assemble.sh.

enriquepola1996 commented 1 month ago

Hi @lemene , the process finish with this:

Plgd script end: 2024-08-01 11:37:05
        Command being timed: "./plant_manu/scripts/asm1_assemble.sh"
        User time (seconds): 4444.19
        System time (seconds): 254.80
        Percent of CPU this job got: 299%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 26:07.18
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 209713668
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1215
        Minor (reclaiming a frame) page faults: 52462642
        Voluntary context switches: 1493451
        Involuntary context switches: 23421
        Swaps: 0
        File system inputs: 378741728
        File system outputs: 13480
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
(pecat-env) villalobos@alnitak:/data/car_manu/PECAT/demo$ free -h
               total        used        free      shared  buff/cache   available
Mem:           251Gi        48Gi       200Gi       706Mi       1.9Gi       199Gi
Swap:          2.0Gi       2.0Gi          0B

Is my memory? Thanks so much.

enriquepola1996 commented 1 month ago

I have another question, it is regarding the prep_output_coverage parameter. In my sequencing report it indicates that a coverage of approximately 60X was achieved, so should I adjust the prep_output_coverage parameter based on this? The same for corr_output_coverage.

Thank you very much.

lemene commented 1 month ago

Hi, @enriquepola1996 corr_output_coverage and prep_output_coverage is used to reduce the size of datasets. In both error correction and assembly steps, PECAT implements all-vs-all comparisons of the reads. If the dataset coverage is too high, it can significantly impact the program’s running efficiency, with only marginal improvements in assembly quality. When corr_output_coverage and prep_output_coverage are greater than the actual coverage of the dataset, they will be ignored.

enriquepola1996 commented 1 month ago

Hello @lemene , the process finish with more memory but now I have two questions, the alternate.fasta is much smaller than the primary.fasta, is that normal? The script I ran is cfg_cattle_clr but I'm still waiting for cfg_arab_clr to finish. In both I set the genome size to 400Mb. In this first result I obtained a genome of 322Mb and an N50 of 18Mb (3-assemble) and 330Mb and an N50 of 16Mb (5-assemble) for primary assembly.

[epola@mazorka test_car_catle]$ ls
0-prepare  1-correct  2-align  3-assemble  4-phase  5-assemble  6-polish  scripts
[epola@mazorka test_car_catle]$ ls -lh 5-assemble/
total 641M
-rw-rw-r-- 1 epola epola 4.1M Aug  2 03:39 alternate.fasta
-rw-rw-r-- 1 epola epola  19K Aug  2 03:39 alternate_tiles
-rw-rw-r-- 1 epola epola 315M Aug  2 03:39 haplotype_1.fasta
-rw-rw-r-- 1 epola epola 1.7M Aug  2 03:39 haplotype_1_tiles
-rw-rw-r-- 1 epola epola 3.9M Aug  2 03:39 haplotype_2.fasta
-rw-rw-r-- 1 epola epola  20K Aug  2 03:39 haplotype_2_tiles
-rw-rw-r-- 1 epola epola 315M Aug  2 03:39 primary.fasta
-rw-rw-r-- 1 epola epola 1.7M Aug  2 03:39 primary_tiles
-rw-rw-r-- 1 epola epola 887K Aug  2 03:39 rest.fasta
-rw-rw-r-- 1 epola epola 5.4K Aug  2 03:39 rest_tiles

I am comparing these results with Canu, where I ran the following:

canu \
useGrid=false \
-p assembly \
-d canu_plant_400 \
genomeSize=400m \
-pacbio-raw test.fasta.gz \
corOutCoverage=200 \
"batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50"

And I got an assembly of 322Mb and an N50 of 12Mb.

The other question is that the polishing process was not successful, what is the probable reason?

[epola@mazorka test_car_catle]$ ls -lh 6-polish/racon/
total 155M
-rw-rw-r-- 1 epola epola    0 Aug  2 07:11 alternate.fasta
-rw-rw-r-- 1 epola epola    0 Aug  2 07:11 haplotype_1.fasta
-rw-rw-r-- 1 epola epola    0 Aug  2 07:11 haplotype_2.fasta
-rw-rw-r-- 1 epola epola    0 Aug  2 07:11 primary.fasta
-rw-rw-r-- 1 epola epola 2.8M Aug  2 07:07 rd_2_alt_names
-rw-rw-r-- 1 epola epola  76M Aug  2 07:11 rd_2_hap1_names
-rw-rw-r-- 1 epola epola 1.9M Aug  2 07:11 rd_2_hap2_names
-rw-rw-r-- 1 epola epola  75M Aug  2 07:07 rd_2_pri_names

The final of the error.log file:

2024-08-02 03:15:03 [Info] Start assembling overlaps, asm2.
Argument "" isn't numeric in addition (+) at /LUSTRE/storage/data/software/PECAT/build/bin/Plgd/Job/Script.pm line 47, <F> line 40.
2024-08-02 03:15:03 [Info] Run script /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/asm2_assemble.sh
2024-08-02 03:15:03 [Info] Sumbit command: /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/asm2_assemble.sh 2>&1 | tee /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/asm2_assemble.sh.log
2024-08-02 03:42:08 [Info] End assembling overlaps, asm2.
2024-08-02 03:42:08 [Info] Start running job pol_job, polishing contigs
2024-08-02 03:42:08 [Info] Start polishing contigs.
2024-08-02 03:42:08 [Info] Job::Serial::submit_core pol_job
2024-08-02 03:42:08 [Info] Start running job pol_job_racon, polishing contigs using racon
2024-08-02 03:42:08 [Info] Start polishing contigs using racon.
2024-08-02 03:42:08 [Info] Job::Parallel::submit_core pol_job_racon
2024-08-02 03:42:08 [Info] Start running job pol_job, polishing job for pri/alt contigs
2024-08-02 03:42:08 [Info] Start polishing job for pri/alt contigs.
2024-08-02 03:42:08 [Info] Job::Serial::submit_core pol_job
2024-08-02 03:42:08 [Info] Start running job pol_map, mapping reads to contigs
2024-08-02 03:42:08 [Info] Start mapping reads to contigs.
Argument "" isn't numeric in addition (+) at /LUSTRE/storage/data/software/PECAT/build/bin/Plgd/Job/Script.pm line 47, <F> line 41.
2024-08-02 03:42:08 [Info] Run script /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_map.sh
2024-08-02 03:42:08 [Info] Start running job pol_job_hap, polishing dual format contigs
2024-08-02 03:42:08 [Info] Sumbit command: /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_map.sh 2>&1 | tee /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_map.sh.log
2024-08-02 03:42:08 [Info] Start polishing dual format contigs.
2024-08-02 03:42:08 [Info] Job::Serial::submit_core pol_job_hap
2024-08-02 03:42:08 [Info] Start running job pol_map_hap, mapping reads to contigs
2024-08-02 03:42:08 [Info] Start mapping reads to contigs.
Argument "" isn't numeric in addition (+) at /LUSTRE/storage/data/software/PECAT/build/bin/Plgd/Job/Script.pm line 47, <F> line 41.
2024-08-02 05:23:13 [Info] Run script /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_map_hap.sh
2024-08-02 05:23:13 [Info] Sumbit command: /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_map_hap.sh 2>&1 | tee /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_map_hap.sh.log
2024-08-02 05:23:13 [Info] End mapping reads to contigs.
2024-08-02 05:23:13 [Info] Start running job pol_filter, filter overlaps in which the reads are different haplotype
2024-08-02 05:23:13 [Info] Start filter overlaps in which the reads are different haplotype.
Argument "" isn't numeric in addition (+) at /LUSTRE/storage/data/software/PECAT/build/bin/Plgd/Job/Script.pm line 47, <F> line 42.
2024-08-02 07:03:39 [Info] Run script /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_filter.sh
2024-08-02 07:03:39 [Info] Sumbit command: /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_filter.sh 2>&1 | tee /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_filter.sh.log
2024-08-02 07:03:39 [Info] End mapping reads to contigs.
2024-08-02 07:03:39 [Info] Start running job pol_filter_hap, filter overlaps in which the reads are different haplotype
2024-08-02 07:03:39 [Info] Start filter overlaps in which the reads are different haplotype.
Argument "" isn't numeric in addition (+) at /LUSTRE/storage/data/software/PECAT/build/bin/Plgd/Job/Script.pm line 47, <F> line 43.
2024-08-02 07:07:34 [Info] Run script /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_filter_hap.sh
2024-08-02 07:07:34 [Info] Sumbit command: /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_filter_hap.sh 2>&1 | tee /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_filter_hap.sh.log
2024-08-02 07:07:34 [Info] End filter overlaps in which the reads are different haplotype.
2024-08-02 07:07:34 [Info] Start running job pol_polish, polishing pri/alt format contigs
2024-08-02 07:07:34 [Info] Start polishing pri/alt format contigs.
2024-08-02 07:07:34 [Info] Job::Parallel::submit_core pol_polish
2024-08-02 07:07:34 [Info] Start running job pol_polish_prj, polishing primary contigs
2024-08-02 07:07:34 [Info] Start polishing primary contigs.
Argument "" isn't numeric in addition (+) at /LUSTRE/storage/data/software/PECAT/build/bin/Plgd/Job/Script.pm line 47, <F> line 44.
2024-08-02 07:07:34 [Info] Start running job pol_polish_alt, polishing alternate contigs
2024-08-02 07:07:34 [Info] Start polishing alternate contigs.
Argument "" isn't numeric in addition (+) at /LUSTRE/storage/data/software/PECAT/build/bin/Plgd/Job/Script.pm line 47, <F> line 44.
2024-08-02 07:11:29 [Info] Run script /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_polish_prj.sh
2024-08-02 07:11:29 [Info] Sumbit command: /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_polish_prj.sh 2>&1 | tee /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_polish_prj.sh.log
2024-08-02 07:11:29 [Info] End filter overlaps in which the reads are different haplotype.
2024-08-02 07:11:29 [Info] Start running job pol_polish_hap, polishing dual format contigs
2024-08-02 07:11:29 [Info] Start polishing dual format contigs.
2024-08-02 07:11:29 [Info] Job::Parallel::submit_core pol_polish_hap
2024-08-02 07:11:29 [Info] Start running job pol_polish_hap_1, polishing dual_1 contigs
2024-08-02 07:11:29 [Info] Start polishing dual_1 contigs.
Argument "" isn't numeric in addition (+) at /LUSTRE/storage/data/software/PECAT/build/bin/Plgd/Job/Script.pm line 47, <F> line 45.
2024-08-02 07:11:29 [Info] Start running job pol_polish_hap_2, polishing dual_2 contigs
2024-08-02 07:11:29 [Info] Start polishing dual_2 contigs.
Argument "" isn't numeric in addition (+) at /LUSTRE/storage/data/software/PECAT/build/bin/Plgd/Job/Script.pm line 47, <F> line 45.
2024-08-02 07:11:34 [Info] Run script /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_polish_hap_2.sh
2024-08-02 07:11:34 [Info] Sumbit command: /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_polish_hap_2.sh 2>&1 | tee /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_polish_hap_2.sh.log
2024-08-02 07:11:34 [Info] End polishing primary contigs.
2024-08-02 07:11:39 [Info] Run script /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_polish_hap_1.sh
2024-08-02 07:11:39 [Info] Sumbit command: /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_polish_hap_1.sh 2>&1 | tee /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_polish_hap_1.sh.log
2024-08-02 07:11:39 [Info] End polishing dual_2 contigs.
2024-08-02 07:11:44 [Info] Run script /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_polish_alt.sh
2024-08-02 07:11:44 [Info] Sumbit command: /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_polish_alt.sh 2>&1 | tee /LUSTRE/usuario/epola/pecat_assembly/test_car_catle/scripts/pol_polish_alt.sh.log
2024-08-02 07:11:44 [Info] End polishing dual_1 contigs.
2024-08-02 07:11:44 [Info] End polishing dual format contigs.
2024-08-02 07:11:44 [Info] End polishing dual format contigs.
2024-08-02 07:11:49 [Info] End polishing alternate contigs.
2024-08-02 07:11:49 [Info] End polishing pri/alt format contigs.
2024-08-02 07:11:49 [Info] End polishing job for pri/alt contigs.
2024-08-02 07:11:49 [Info] End polishing contigs using racon.
2024-08-02 07:11:49 [Info] End polishing contigs.

Thanks so much.

lemene commented 1 month ago

Hi @enriquepola1996

  1. The alternate.fasta file is much smaller than the primary.fasta file when the heterozygosity of the diploid species is very low. This is normal.
  2. The reason PECAT with cfg_arab_clr runs slower than with cfg_cattle_clr is due to the parameters corr_rd2rd_options and align_rd2rd_options, which are passed to minimap2. '-f` is the key parameter. See Note
  3. The failure of the Racon may be due to insufficient memory. Try deleting the 6-polish directory and then run PECAT again with larger memory.
enriquepola1996 commented 1 month ago

@lemene Thank you very much for your valuable answers, I have two last question, do you recommend running purge haplotigs on the primary assembly (5-assemble)? Do you recommend polishing with raw or corrected reads?

Thanks so much.

lemene commented 1 month ago

@enriquepola1996

  1. 5-assemble/primary.fasta does not require purging of haplotigs. Ideally, different haplotypes should be output in 5-assemble/alternate.fasta.
  2. Based on my experience,we recommend using corrected reasds for PacBio CLR data and raw reads for ONT data.
enriquepola1996 commented 1 month ago

@lemene Thanks so much for you support, you can close this issue as resolved.