seqan / iGenVar

The official repository for the iGenVar project.
BSD 3-Clause "New" or "Revised" License
9 stars 8 forks source link

[BENCHMARK] dataset comparison igenvar only #201

Closed Irallia closed 2 years ago

Irallia commented 2 years ago

UPDATE: look at the new plots below.

In this plot you can see the results of iGenVar with 2 short read and 3 long read sets and their combinations.

iGenVar_only-results all Some of the example BAM files (all HG002) are aligned to different references:

    MtSinai_PacBio:     GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    PacBio_CCS_10kb:    hs37d5.fa
    10X_Genomics:       hg19.reordered.fa
    Illumina:           GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa
    Illumina_Mate_Pair: GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa

Since the truth set does not contain DUPs, I created another plot where all DUPs are interpreted as INS.

iGenVar_only-results DUP_as_INS all

The important question now is, are most sets really that bad, or does iGenVar simply not find SVs. This I want to find out in comparison with other callers.

codecov[bot] commented 2 years ago

Codecov Report

Merging #201 (41fbde1) into master (e02d86f) will not change coverage. The diff coverage is n/a.

:exclamation: Current head 41fbde1 differs from pull request most recent head 9de92e5. Consider uploading reports for the commit 9de92e5 to get more accurate results

@@           Coverage Diff           @@
##           master     #201   +/-   ##
=======================================
  Coverage   98.35%   98.35%           
=======================================
  Files          18       18           
  Lines         850      850           
=======================================
  Hits          836      836           
  Misses         14       14           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e02d86f...9de92e5. Read the comment docs.

Irallia commented 2 years ago

Updated plots (GRCh37) iGenVar_only-results all DUPs as INS iGenVar_only-results DUP_as_INS all

joshuak94 commented 2 years ago

This is interesting! It makes sense that pacbio CCS reads are much easier to call SVs from: they have long read lengths and relatively high accuracy. However, it is a bit concerning that just illumina mate-pair reads result in such low accuracy. I'd be curious to see how this looks specifically with something like Deletions, since deletions can be detected somewhat robustly just via read-depth.

Maybe if we can find out if there is a specific variant which is bringing the whole curve down, we can better understand what the issue there is.