vgteam / toil-vg

Distributed and cloud computing framework for vg
Apache License 2.0
21 stars 14 forks source link

Pass variant names via the GBWT #741

Open glennhickey opened 5 years ago

glennhickey commented 5 years ago

There are two ways to look at variation inside SVs

That second option simply does not scale to whole genome, because xg cannot index all the paths from the input vcf. So it's best just to use the GBWT. To that end, we'd need to

Calling on the SV paths directly allows using different parameters for snps than svs (probably necessary). But doing SVs and nested variation at the same time will have a chance at resolving breakpoints.

adamnovak commented 5 years ago

So you want to store the alt paths for large SVs as threads in the GBWT, and then call against them?

We'd want to keep a separate GBWT from the one we use for storing the haplotypes, I think. And the GBWT format isn't designed for efficient offset determination, and I'm not sure how fast it is at tracing out a whole thread, so we'd be kind of misusing it. But these SV threads would be short and few enough that we probably could extract them and index them for offset queries in memory when needed.

Shouldn't we consider an alternative, snarl-based variant call representation for this instead, though? Or is this a prerequisite for applying some of the VCF extensions proposed at the hackathon?

On 3/27/19, Glenn Hickey notifications@github.com wrote:

There are two ways to look at variation inside SVs

  • Run the calling in something like --recall mode, but turn on graph augmentation. Then compare the insertion sequences that are called back to the input. This is what we did at the hackathon
  • Preserve the paths of SVs from the input VCF to construction. Then run vg as a snp caller on each path. I've got a run going on chr1 for this, by way of hacking vg construct -a to write a normal (instead of alt) path for input variants.

That second option simply does not scale to whole genome, because xg cannot index all the paths from the input vcf. So it's best just to use the GBWT. To that end, we'd need to

  • have an option to write human readable alt sequence names (fairly trivial) provided the input vcf contains unique ids (often the case)
  • input the GBWT to toil-vg call with an option to call on GBWT threads in addition to or instead normal paths
  • input the GBWT to vg call (if possible) so that the path name can get passed into the VCF info field (if not calling on the path directly).

Calling on the SV paths directly allows using different parameters for snps than svs (probably necessary). But doing SVs and nested variation at the same time will have a chance at resolving breakpoints.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/vgteam/toil-vg/issues/741

glennhickey commented 5 years ago

I hadn't thought it through too clearly before typing that, but agree that the GBWT as is probably isn't what I'm looking for. Am open to other alternatives -- perhaps we can chat soon.

I'll look at the results of just doing it through the xg first to see if it's worth pursuing. I think a general purpose annotation index for short paths and/or subgraphs would be useful for this and other stuff too.

On Thu, Mar 28, 2019 at 10:40 AM Adam Novak notifications@github.com wrote:

So you want to store the alt paths for large SVs as threads in the GBWT, and then call against them?

We'd want to keep a separate GBWT from the one we use for storing the haplotypes, I think. And the GBWT format isn't designed for efficient offset determination, and I'm not sure how fast it is at tracing out a whole thread, so we'd be kind of misusing it. But these SV threads would be short and few enough that we probably could extract them and index them for offset queries in memory when needed.

Shouldn't we consider an alternative, snarl-based variant call representation for this instead, though? Or is this a prerequisite for applying some of the VCF extensions proposed at the hackathon?

On 3/27/19, Glenn Hickey notifications@github.com wrote:

There are two ways to look at variation inside SVs

  • Run the calling in something like --recall mode, but turn on graph augmentation. Then compare the insertion sequences that are called back to the input. This is what we did at the hackathon
  • Preserve the paths of SVs from the input VCF to construction. Then run vg as a snp caller on each path. I've got a run going on chr1 for this, by way of hacking vg construct -a to write a normal (instead of alt) path for input variants.

That second option simply does not scale to whole genome, because xg cannot index all the paths from the input vcf. So it's best just to use the GBWT. To that end, we'd need to

  • have an option to write human readable alt sequence names (fairly trivial) provided the input vcf contains unique ids (often the case)
  • input the GBWT to toil-vg call with an option to call on GBWT threads in addition to or instead normal paths
  • input the GBWT to vg call (if possible) so that the path name can get passed into the VCF info field (if not calling on the path directly).

Calling on the SV paths directly allows using different parameters for snps than svs (probably necessary). But doing SVs and nested variation at the same time will have a chance at resolving breakpoints.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/vgteam/toil-vg/issues/741

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/toil-vg/issues/741#issuecomment-477695968, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7glLuOg0uy9K4hB0qh5dYkSIEw-6ks5vbPwegaJpZM4cOo1l .