lemene / PECAT

PECAT, a phased error correct and assembly tool
BSD 2-Clause "Simplified" License
36 stars 0 forks source link

Improving haplotypes and primary/alternate genomes #38

Open Karenmagh opened 3 weeks ago

Karenmagh commented 3 weeks ago

Hello dear developers,

I have run PECAT on a diploid plant genome (heterozygosity=3.3) and obtained a primary (4.7Gb) and alternate (2.6Gb) set and haplotype1 (4.7Gb) and haplotype2 (2.8Gb). Do you think there is a possibility of improving the size of the genomes by moving some parameters? I will greatly appreciate your response.

My script is this:

project=plant
reads= $READS
genome_size= 4200000000
threads=20
grid=local
cleanup=1

prep_min_length=3000
prep_output_coverage=50

corr_iterate_number=1
corr_block_size=8000000000
corr_filter_options=--filter0=:al=5000:alr=0.5:aal=8000:aalr=0.5:oh=2000:ohr=0.2
corr_correct_options=--score=weight:lc=10 --aligner diff --filter1 oh=1000:ohr=0.01 --candidate n=300:f=20
corr_rd2rd_options=-x ava-pb -f 0.005 -I 20G
corr_output_coverage=50

align_block_size=4000000000
align_rd2rd_options=-X -g3000 -w30 -k19 -m100 -r500 -I 20G -f 0.005
align_filter_options=--filter0=l=3000:al=3000:alr=0.5:aalr=0.5:oh=1000:ohr=0.1 --task=extend --filter1=oh=100:ohr=0.01

asm1_assemble_options= --max_trivial_length 10000

phase_rd2ctg_options=-x map-pb -c -p 0.5 -r 1000
phase_use_reads=1
phase_phase_options= --phase_options icr=0.2
phase_filter_options= --threshold=1000

asm2_assemble_options= --max_trivial_length 10000 --contig_format dual,prialt

polish_use_reads=1
polish_map_options = -x asm20
polish_filter_options = --filter0 oh=1000:ohr=0.1
polish_cns_options =
lemene commented 2 weeks ago

@Karenmagh Sorry for taking so long to get back to you! The dataset is high heterozygosity. PECAT failed to pair some contigs according to their haplotypes. --contig_dup_rate=0.2 may work. Its default value is 0.3. The config can be modified as follows.

asm2_assemble_options= --max_trivial_length 10000 --contig_format dual,prialt --contig_dup_rate=0.2
Karenmagh commented 1 week ago

Hello @lemene,

Thank you very much for your response, I have tried --contig_dup_rate=0.2 and my haplotypes adjusted a little more, I also tried 0.15 and now I have haplotype 1 of 4Gb and haplotype 2 of 3.2Gb, could you help me explain a little about the management appropriate of this parameter? Is there a problem if I lower it more?

I will greatly appreciate your help.

lemene commented 1 week ago

Hi @Karenmagh I'm glad this parameter is working. If heterozygosity is low, two reads from different haplotypes might map to the same position in the initial assembly (3-assemble/primary.fasta). PECAT can identify that the overlap between them is inconsistent using the SNP alleles. The pair of the reads is inconsistent. --contig_dup_rate=0.2 means if 20% of the reads that make up one contig are inconsistent with the reads of another contig, then one contig will be placed in 5-assemble/primary.fasta and the other in 5-assemble/alternate.fasta. If heterozygosity is high, reads from different haplotypes may not overlap, making it difficult for PECAT to identify inconsistent read pairs. Therefore, it is necessary to reduce this parameter. But I'm unsure of the outcome if it is set too low.