vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

Too low precision for indel #2930

Open a00101 opened 4 years ago

a00101 commented 4 years ago

Dear developer.

I am working on measuring the performance of vg using Illumina Platinum Genome data.

Compared to hisat2, it was confirmed that indel precision in vg was too low. I used 1KG reference file for constructing graph for hisat2 and vg the same. Is there any way to improve it? Can you peruse the command below?

Use [vg: variation graph tool, version v1.25.0 "Apice"]

> vg construct -r hg38.fa -v dbsnp.vcf.gz > first.vg
> vg convert -x first.vg > first.xg
> vg prune -r first.vg > second.vg
> vg index -g third.gcsa second.vg
> vg map -x first.xg -g third.gcsa -f test_trimmed_1.fq.gz -f test_trimmed_2.fq.gz > fourth.gam
> vg pack -x first.xg -g fifth.gam -Q 5 -o seventh.pack
> vg call -x first.xg -k seventh.pack > final.vcf
<RESULTS>
<hisat2 + strelka2>
F1_snp = 0.97
F1_ind = 0.87

<vg>
F1_snp = 0.96
F1_ind = 0.58
glennhickey commented 4 years ago

That does indeed seem too low. What is the read depth? Which tool are you using to compute accuracy? What is the precision and recall? Sometimes you can squeeze a bit more accuracy out of small variant calling with the vg filter command at the bottom of here but I think there's something more serious going on.

a00101 commented 4 years ago

Thank you for your rapid answer.

1) read depth

2) Which tool

3) precision and recall

The most likely reason seems to be wrong when constructing a graph. Are there any errors in the command?

ekg commented 4 years ago

Why don't you also use vg + strelka? This would confirm if the issue is in the caller or not.

On Thu, Jul 30, 2020, 03:00 a00101 notifications@github.com wrote:

Thank you for your rapid answer.

  1. read depth

    • How to calculate read depth from GAM file?
  2. Which tool

    • hap.py
  3. precision and recall

    • INDEL Precision 0.62 Recall 0.55
    • SNP Precision 0.99 Recall 0.94

The most likely reason seems to be wrong when constructing a graph. Are there any errors in the command?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2930#issuecomment-666011939, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQELICDT7S2VOFEWOD5TR6DA35ANCNFSM4PMKUJFA .

a00101 commented 4 years ago

Thank you for your reply. I will show you the result.

ekg commented 4 years ago

You can do so with --surject-to bam in vg map.

On Thu, Jul 30, 2020 at 9:48 AM a00101 notifications@github.com wrote:

Thank you for your reply.

Where Can I find convert gam to bam?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2930#issuecomment-666199350, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEKCQ2NBKWXQK56YG3LR6EQVTANCNFSM4PMKUJFA .

ekg commented 4 years ago

Sorry, then the output will go directly to BAM. Otherwise, to convert the GAM, use vg surject.

On Thu, Jul 30, 2020 at 12:56 PM Erik Garrison erik.garrison@gmail.com wrote:

You can do so with --surject-to bam in vg map.

On Thu, Jul 30, 2020 at 9:48 AM a00101 notifications@github.com wrote:

Thank you for your reply.

Where Can I find convert gam to bam?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2930#issuecomment-666199350, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEKCQ2NBKWXQK56YG3LR6EQVTANCNFSM4PMKUJFA .

glennhickey commented 4 years ago

You can get the coverage with vg depth -g. Another way to check if the caller's at issue is running som.py instead of hap.py -- if the results are comparable then it's probably not the caller.

The way you are running is only going to call variants exactly as they are in the graph. So if there's a small difference in the indels you're calling and the indels you're comparing to, that could also be a source of the error. To account for differences between the sample and the graph, you need to run the augmentation step.