natir / br

Brutal Rewrite a long read corrector
MIT License
2 stars 0 forks source link

FYI - Evaluation of different methods with HiFi reads and harmony #3

Closed jelber2 closed 2 years ago

jelber2 commented 2 years ago

FYI - following https://github.com/PacificBiosciences/harmony/issues/1#issuecomment-1025682192 using E coli ~30x PacBio CCS reads

reference=/nfs/scistore16/itgrp/bioinf/projects/DA0030/2021_Aug_27/analysis.5/raw/pg_asm-0.4.10/Sample9/9-peregrine-2021-0.4.10-3x-circlator.fasta
harmony=~/bin/harmony-0.2.0
cores=34
k=19

# activate pbmm2-1.7.0 and samtools-1.14
conda activate pbmm2-1.7.0
module load samtools/1.14

# align ccs.bam file to reference (ccs.bam generated with pbccs-6.3.0 with default settings)
pbmm2 align --best-n 1 --preset HiFi -j ${cores} --sort --bam-index BAI \
${reference} ../9.ccs.bam original-to-reference.bam

# run harmony-0.2.0
${harmony} -j ${cores} original-to-reference.bam ${reference} original

# convert BAM to FASTA
samtools fasta -@${cores} original-to-reference.bam > original.fasta

# run br [a7b62da](https://github.com/natir/br/commit/a7b62dac1e1c84e8767c23440e57c1e5a81f00fc)
# on ccs reads (E coli, ~30x ccs coverage) for each br method with k=19
rm -f br.log
for reads in `echo -e "one\ntwo\ngraph\ngreedy\ngap_size"`
do
    echo "running br with k=${k} and method=${reads}">> br.log 2>&1
    ~/git/br/target/release/br -t ${cores} -k ${k} -i original.fasta -m ${reads} -o ${reads}.fasta >> br.log 2>&1
done &

for reads in `echo -e "one\ntwo\ngraph\ngreedy\ngap_size"`
do
  echo "$reads"
  echo "step1"
  #  align reads to reference
  pbmm2 align --best-n 1 --preset HiFi -j ${cores} --sort --bam-index BAI \
  ${reference} ${reads}.fasta ${reads}-to-reference.bam 2>/dev/null

  echo "step2"
  # get just the read names of the ccs reads that were corrected by br (note - I used default settings of pbccs-6.3.0 to generate
  # the ccs reads) then filter the pbccs output aligned bam by those read names
  samtools view -@${cores} --qname-file <(samtools view -@34 ${reads}-to-reference.bam|cut -f 1) original-to-reference.bam > ${reads}-filter.sam

  echo "step3"
  # get SAM/BAM columns 12-20 for the ccs reads
  cat ${reads}-filter.sam|sort -k1,1|cut -f 12-20 > 12-20-${reads}

  echo "step4"
  # get columns 1-11, and 19 for the br corrected reads
  samtools view -@${cores} ${reads}-to-reference.bam |sort -k1,1|cut -f 1-11 > 1-11-${reads}
  echo "step5"
  samtools view -@${cores} ${reads}-to-reference.bam |sort -k1,1|cut -f 15 > 19-${reads}

  echo "step6"
  # combine all the columns together an make a BAM file
  samtools view -@${cores} -Sb <(cat <(samtools view -@${cores} -H ${reads}-to-reference.bam) <(paste 1-11-${reads} 12-20-${reads} 19-${reads})) > ${reads}-harmony.bam

  echo "step7"
  # this makes harmony run on the br corrected reads used the auxillary tags produced by pbccs
  ${harmony} -j ${cores} ${reads}-harmony.bam ${reference} ${reads}
done &

conda deactivate
conda activate harmony
~/bin/harmony/scripts/single.R original one two graph gap_size greedy

harmony

jelber2 commented 2 years ago

Note that Effective Coverage is not effective coverage of the reference assembly but of the specific CCS/HiFi read based on its PacBio subreads.

jelber2 commented 2 years ago

with -m graph and -k 21 added harmony

natir commented 2 years ago

Thanks for your interest on br and result you show.

But what is your question ?

jelber2 commented 2 years ago

My apologies, there is no question. Thank you for developing a great tool! Just thought I would share some results.

natir commented 2 years ago

No problem I just want be sure I didn't miss anything.

What is the meaning of "Mean Gap-Compressed Identity Phred" ?

jelber2 commented 2 years ago

@natir I am not entirely sure the exact meaning other than sharing the equation here - https://github.com/PacificBiosciences/harmony/blob/7f076ef4d3ea81d52502b052455398bbace7a818/scripts/single.R#L20 I could redo the analysis with https://github.com/PacificBiosciences/harmony/blob/7f076ef4d3ea81d52502b052455398bbace7a818/scripts/single.R#L21 (mean identity phred) tomorrow.

natir commented 2 years ago

Ok it's seems a combination of all type of error.

jelber2 commented 2 years ago
for i in samples
do
  echo $i
  ~/bin/minimap2-2.24_x64-linux/minimap2 -t 34 -a --secondary=no -x map-hifi \
    /nfs/scistore16/itgrp/bioinf/projects/DA0030/2021_Aug_27/analysis.5/raw/pg_asm-0.4.10/Sample9/9-peregrine-2021-0.4.10-3x-circlator.fasta \
    ${i}.fasta 2>/dev/null|samtools stats |grep ^SN | cut -f 2- > ${i}.stats
done

rm -f test
for i in `ls *.stats`
do
  j=`echo $i|perl -pe "s/.stats//g"`
  echo "${i}" >> test
  grep -Po "error rate:\t\S+" ${j}.stats |cut -f 2|awk '{printf "%0.1f\n",-10*log($1)/log(10);}' >> test
done

br_method                         PhredScore
gap_size.stats                    40.0
graph-k21.stats                   42.6
graph.stats                       41.0
greedy.stats                      26.9
one.stats                         35.4
original.stats                    26.9
two.stats                         31.0

@natir The table above is the error rate converted to Phred from samtools stats. Code is for only showing actual analysis steps.