mobinasri / flagger

Evaluating genome assemblies
MIT License
58 stars 4 forks source link

duplicated.bed is empty #7

Closed zhoudreames closed 1 year ago

zhoudreames commented 2 years ago

Following your flagger pipline, I finally gain the four files, the haploid.bed(99.55%),error.bed(0.013%), collosed.bed( 0.437%) and duplicated.bed(0%). The duplicated.bed file here is completely empty. I don't know what the problem is, how to deal with it ? Thanks This my code:

#1.Mapping HiFi reads by winnowmap
$Winnowmap/meryl count k=15 output merylDB $ref
$Winnowmap/meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

$Winnowmap/winnowmap -Y -L --eqx --cs -I8g  -W repetitive_k15.txt  -ax map-pb $ref $reads_pool  > HiFi_all.sam
samtools view -hb -@20 DRCMale_HiFi_all.sam -o HiFi_all.bam
bam=HiFi_all.bam
ref=Combined.fa
prefix=dipGenome
EXPECTED_COVERAGE=28

#2.secphase:gain the corrected bam
#Phase reads for HiFi
docker run   \
-v "${PWD}":"/input" \
mobinasri/secphase:v0.1 \
secphase -q -c -t10 -d 1e-4 -e 0.1 -b20 -m20 -s40 \
-i "/input/$bam" \
-f "/input/$ref" >PHASING_OUT.log
#produce the modified bam file
docker run   \
-v "${PWD}":"/input" \
-v "${PWD}":"/output" \
mobinasri/secphase:v0.1 \
correct_bam \
-i "/input/$bam" \
-P "/input/PHASING_OUT.log" \
-o "/output/${prefix}.corrected.bam" \
--primaryOnly

samtools sort -@20 ${prefix}.corrected.bam >${prefix}.corrected.sorted.bam
samtools index -@20 ${prefix}.corrected.sorted.bam

#3.DeepVariants
mkdir -p output
mkdir -p output/intermediate_results_dir
BIN_VERSION="1.2.0"
docker run \
  -v "${PWD}":"/input" \
  -v "${PWD}/output":"/output" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type "PACBIO" \
  --ref /input/$ref \
  --reads /input/${prefix}.corrected.sorted.bam \
  --output_vcf /output/${prefix}.vcf.gz \
  --make_examples_extra_args="keep_supplementary_alignments=true,min_mapping_quality=0" \
  --call_variants_extra_args="use_openvino=true" \
  --num_shards 144 \
  --intermediate_results_dir /output/intermediate_results_dir \
  --dry_run=false
bcftools view -Ov -f PASS -m2 -M2 -v snps -e 'FORMAT/VAF< 0.3 | FORMAT/GQ< 10' output/${prefix}.vcf.gz > ${prefix}.filtered.vcf.gz

#4.Flagger: Detected misassmeblies
docker run   \
-v "${PWD}":"/input" \
-v "${PWD}":"/output" \
 mobinasri/flagger:v0.1 \
 filter_alt_reads \
-i "/input/${prefix}.corrected.sorted.bam" \
-o "/output/${prefix}.alt_filtered.bam" \
-f "/output/${prefix}.alt.bam" \
-v "/input/${prefix}.filtered.vcf.gz" \
-m 100 -r 0.4

## Find base-level coverages
#samtools depth  -aa -Q 0 ${prefix}.alt_filtered.bam > read_alignment.depth
# Convert depth to cov
docker run \
 -v "${PWD}":"/input" \
 -v "${PWD}":"/output" \
 mobinasri/flagger:v0.1 \
 depth2cov \
 -d "/input/read_alignment.depth" \
 -f "/input/$ref.fai" \
 -o "/output/read_alignment.cov"

#Coverage Distribution and Fitting The Mixture Model
docker run \
  -v "${PWD}":"/input" \
 -v "${PWD}":"/output" \
 mobinasri/flagger:v0.1 \
 cov2counts \
 -i "/input/read_alignment.cov" \
 -o "/output/read_alignment.counts"

### Run fit_model_extra.py to fit the model
docker run \
   -v "${PWD}":"/input" \
 -v "${PWD}":"/output" \
 mobinasri/flagger:v0.1 \
 python3 /home/programs/src/fit_gmm.py \
 --counts  "/input/read_alignment.counts" \
 --cov ${EXPECTED_COVERAGE} \
 --output "/output/read_alignment.table" 

## Run find_blocks_from_table
docker run \
  -v "${PWD}":"/input" \
 -v "${PWD}":"/output" \
 mobinasri/flagger:v0.1 \
 find_blocks_from_table \
 -c "/input/read_alignment.cov" \
 -t "/input/read_alignment.table" \
 -p "/output/$prefix"
zhoudreames commented 2 years ago

@mobinasri I am so sorry to disturb you, can you help me ? Thanks~

mobinasri commented 2 years ago

@zhoudreames Sorry for the late response. In general it is fine to have empty duplication bed file. But it seems you only used whole-genome coverage distribution which may lack sensitivity in some cases. The reason is explained in this link https://github.com/mobinasri/flagger/blob/main/docs/flagger/README.md#1-window-specific-models

It is recommended to perform all the correction steps. One of them is splitting the assembly into windows of length 5-10Mb and then run fit_gmm.py on each window separately. It can help the model to detect false duplication if the duplication rate is low and not visible in the whole-genome coverage distribution.

It is easier to use the WDL files (Flagger_Preprocess and Flagger.wdl) provided for this aim.