vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 193 forks source link

cleanup Alignment object #1587

Open ekg opened 6 years ago

ekg commented 6 years ago

The Alignment object is extremely bloated by annotations that are only relevant to SAM compatibility. I suggest we pack these into the same kind of integer field that is used in the SAM record and remove the numerous boolean and annotation fields that we've added.

The current object requires 256 bytes in memory. This is way too much for what the object is doing.

adamnovak commented 6 years ago

What do we do about bools that we want that aren't in SAM? Or about SAM flags that might not be applicable here?

Do we want to remove all the non-bool annotations? Were they only added as temporary hacks? Or do they represent things that we still sometimes want to annotate alignments with? If so, how will we handle that annotation?

I am thinking about an annotation system for tagging Alignments and MultipathAlignments with features in a more sustainable way, because https://github.com/vgteam/vg/issues/1603 and some of the machine learning stuff I am looking into kind of want to be able to attach their own data to Alignments.

jeizenga commented 6 years ago

While we're thinking about redesigning Alignments, my vote is that the fragment_next and fragment_prev fields be merged into field that is just a string instead of another Alignment (i.e. the way I have it in MultipathAlignments). Molecularly, the next/prev distinction is not meaningful. Since we sequence the reverse complement of the other side of the fragment, the opposite read is forward along the genome from the perspective of both reads. I know there are subtle differences between the read1's and read2's from Illumina machines, but I feel like that is such a narrow platform-dependent feature that it's best just left in the read-naming convention if people want to use it.

adamnovak commented 6 years ago

I think we have the next/prev setup so we can do multi-fragment templates. SAM/BAM has support for this; you can express that more than 2 reads are part of the same template molecule. Are we sure we will never want to support a technology that does that?

On Thu, Apr 5, 2018 at 11:30 AM, Jordan Eizenga notifications@github.com wrote:

While we're thinking about redesigning Alignments, my vote is that the fragment_next and fragment_prev fields be merged into field that is just a string instead of another Alignment (i.e. the way I have it in MultipathAlignments). Molecularly, the next/prev distinction is not meaningful. Since we sequence the reverse complement of the other side of the fragment, the opposite read is forward along the genome from the perspective of both reads. I know there are subtle differences between the read1's and read2's from Illumina machines, but I feel like that is such a narrow platform-dependent feature that it's best just left in the read-naming convention if people want to use it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1587#issuecomment-379033883, or mute the thread https://github.com/notifications/unsubscribe-auth/AE0_XzE0pVT6g9oAAp2iCn31uL5o0RP-ks5tlmLcgaJpZM4TBfxr .

jeizenga commented 6 years ago

Are we considering this settled as closeable?

adamnovak commented 6 years ago

I think we have an idea of what to do: refactor things that use the in-Alignment annotations into things that use the new annotation system. But we haven't moved over very much stuff yet.