kishwarshafin / pepper

PEPPER-Margin-DeepVariant
MIT License
244 stars 42 forks source link

am I running this wrong? #183

Open shinlin77 opened 1 year ago

shinlin77 commented 1 year ago

I have SR and LR on the same genomes. Previously, I used the following pipelines on the

SR: bowtie2 -> picard -> GATK LR: minimap2 -> clair3 -> GATK (to round up the gvcfs)

and got reasonable concordance between the two, meaning most of the variants in one were in the other. Moreover, for variants called by both, the majority of the individual calls were the same, too.

Now, I was told PMD was a great caller for ONT data, so I used the following updated pipeline

LR updated: minimap2 -> PMD -> GLnexus (to round up the gvcfs)

however, the SR and LR updated results had almost no concordance. I feel maybe I'm not running PMD right. This is the commandline

singularity exec --bind /where_data_is/ --bind /local/scratch/ --bind /where_genome_is/ /server/apps/software/PEPPER_Margin_DeepVariant/0.8/docker/pepper_deepvariant run_pepper_margin_deepvariant call_variant -b LR_aligned.sorted.bam -f human_genome.fasta -o /output/PMD -p pepper_margin_deepvariant -t 16 --ont_r9_guppy5_sup --gvcf

Interestingly, the result has very few SNVs (they are almost all SVs). Do you have any suggestions (e.g. parameters to tweak)? Thanks.

kishwarshafin commented 1 year ago

Hello @shinlin77 ,

Can you provide a little more information about this?

1) How (not why, sorry) are you running GATK after clair3? Would it be the case that you are running GATK for both SR and LR and seeing high concordance?

2) What is the basecaller you are using for the data, can you make sure it's SUP?

3) What are the GLnexus parameters are you using? I suspect this is where you are filtering out a lot of variants.

Another way to compare is to take one sample, run PMDV and GATK and then compare before rounding up GVCFs, as this pipeline is modular, it should be easy to compare. You can also compare the VCFs using hap.py to derive a concordance set. We usually expect very high concordance if you have sufficient coverage.

shinlin77 commented 1 year ago

Here are my answers...

  1. For LR pipeline, I am using GATK not to call (clair3 is doing that). I use GATK GenotypeGVCFs just round up all the individual gvcfs generated from clair3.

  2. Sorry, I don't understand your comment on SUP. Which pipeline do you want me to look into (SR, LR, or LR updated)?

  3. singularity run -B /where_files_are/:/in -B /where_files_are/ $GLNEXUS_DOCKERFILE glnexuscli --config DeepVariant -m 256 -t 12 --dir /tmp/GLnexus.DB \ /data/PMD/individual1.g.vcf.gz \ /data/PMD_/individual2.g.vcf.gz \ ... > /output/cohort_PMD.bcf

I don't think the aggregation of the gvcfs step is making that big of a difference.

kishwarshafin commented 1 year ago

@shinlin77 , what happens if you use GATK GenotypeGVCFs with PMDV gvcfs? This behavior is inconsistent.

shinlin77 commented 1 year ago

GATK GenotypeGVCFs does not work on PMDV gvcfs. There are too many incompatibilities. For example PMDV gvcfs have

<*>

instead of

I changed those, which was easy enough, but then I ran into other incompatibilities that I couldn't figure out.
kishwarshafin commented 1 year ago

hi @shinlin77 ,

One of the easier way to reduce variability between pipeline and compare the variant calls is to take these files from one sample (not combined):

1) VCF output (not gvcf) from short read variant call (short_read.vcf.gz) 2) VCF output from long read PMDV pipeline (pmdv.vcf.gz)

And run:

docker run -it -v /input:/input \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/input/short_read.vcf.gz \
/input/pmdv.vcf.gz \
-r /input/human_reference.fasta \
-o /input/happy_output \
--pass-only \
--no-roc \
--no-json \
--engine=vcfeval \
--threads=$(nproc)

The TP number will tell you how many variants are exactly the same (GT and allele-wise) between the two VCFs. You can do the same with clair3 vcf and compare the numbers.

shinlin77 commented 1 year ago

Here are the results. I'm not sure why the TRUTH.TOTAL columns are different when using the exact same gold standard, but apparently, somebody else has asked that same question on the hap.py github portal (without a response). As you can see, the recall for PMDV is very low.

image

kishwarshafin commented 1 year ago

@shinlin77 ,

The numbers look very discordant here. I am also unsure what is happening. Is it possible for you to share the bams so I can run them to see if there's something we are missing? Only chr20 would do in this case I think as the numbers are so off.