feature wishlist - Githubissues

ekg commented 4 years ago

Here are some things that I'd love to see in GraphAligner:

Non-GAM textual output (GAF?) to simplify downstream processing of alignments
Independent aligner library (to embed in other tools)
Seeding based on minimizers from paths embedded in the graph, possibly via a GBWT index
Optimizations for short reads and chromosome scale contigs
Building independently from conda (with or without GAM/.vg support)

I'd love to help you implement these. Please let me know how I can help.

maickrau commented 4 years ago

Non-GAM textual output (GAF?) to simplify downstream processing of alignments

This sounds totally reasonable.

Independent aligner library (to embed in other tools)

I'd like to do this as well. It will take some refactoring so I'm not sure about the timeline.

Seeding based on minimizers from paths embedded in the graph, possibly via a GBWT index

This needs some way of picking the paths. Embedding the paths in the graph would work, it just needs to get that information from reading the GFA to building the minimizer index. Once we have the paths as vectors of node IDs the current minimizer code could be used with some changes.

Optimizations for short reads and chromosome scale contigs

This needs the path based indexing first. The condition for clipping the alignment will need some work as well. A good start would be to just get an example graph and dataset and look at how exactly it breaks for short reads.

Building independently from conda (with or without GAM/.vg support)

Could you give an example use case for this?

ekg commented 4 years ago

Non-GAM textual output (GAF?) to simplify downstream processing of alignments

This sounds totally reasonable.

I saw you just released this. Nice! You're the second implementation. I'll add output for this to vg map and we can have three.

In the future, I'd like to work on a binary version of this, hopefully with more flexibility than SAM/BAM (the crisis over long CIGAR strings comes to mind) and ideally considering support for graph to graph alignment (this requires the query and targets to both be paths through graphs).

There are some remaining points of confusion for me about GAF. For instance, how do we encode structural variation, such as where we break an alignment mid-node and restart it far away? I think we might have to use the hard clip operators in the CIGAR strings to skip over parts of reference sequence nodes, listing them in order. This could produce very weird results when the nodes are extremely long. This is a little easier to deal with in the model where each node has its own alignment description, including a start position and orientation on the node. It's not too late for us to go down this route, and I think it has significant benefits for whole genome alignments. CC @lh3

Seeding based on minimizers from paths embedded in the graph, possibly via a GBWT index

This needs some way of picking the paths. Embedding the paths in the graph would work, it just needs to get that information from reading the GFA to building the minimizer index. Once we have the paths as vectors of node IDs the current minimizer code could be used with some changes.

Right. Realistically, this would have to happen over paths embedded in the graph. The GBWT could be used externally to support that. Also, @jltsiren has a tool that generates a covering path set of the graph. That could be useful when you don't have paths handy, to allow clustering to run across nodes.

Building independently from conda (with or without GAM/.vg support)

Could you give an example use case for this?

It would probably simplify packaging and future support. I've been learning about GNU Guix, which is an interesting alternative to conda that seems somewhat more future-resistant.

lh3 commented 4 years ago

For instance, how do we encode structural variation, such as where we break an alignment mid-node and restart it far away?

Like in SAM and PAF, you create two or more lines for one query sequence.

ekg commented 4 years ago

That works, but encourages information loss and ambiguity. I would prefer that SVs that are discovered directly by alignment should be described in a single line.

A whole genome aligner will make such alignments. Where the alignment is broken at SVs it might make sense to emit multiple lines, but that's not how all methods work.

There is a difference between a collection of alignments between sequences and graphs and an optimal alignment fully covering a sequence or graph query. Breaking up the alignment into fragments makes it difficult to communicate this.

rlorigro commented 4 years ago

    Building independently from conda (with or without GAM/.vg support)

Could you give an example use case for this?

Can I add that conda is sometimes incompatible with existing installations of other software? In particular, conda does not provide functional linking for the pybind11 C++ library, which is required for Shasta to build properly. So GraphAligner and Shasta cannot easily coexist on the same machine.

And for anyone wanting to contribute to this project, building with conda is not ideal.

ekg commented 4 years ago

We have a guix package model to build it. I think it is in my guix-genomics repo.

On Mon, Mar 23, 2020, 19:21 Ryan Lorig-Roach notifications@github.com wrote:

Building independently from conda (with or without GAM/.vg support)
Could you give an example use case for this?

Can I add that conda is sometimes incompatible with existing installations of other software? In particular, conda does not provide functional linking for the pybind11 C++ library, which is required for Shasta to build properly. So GraphAligner and Shasta cannot easily coexist on the same machine.

And for anyone wanting to contribute to this project, building with conda is not ideal.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maickrau/GraphAligner/issues/9#issuecomment-602774225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEK5GBX6DG6PYF2QWQTRI6SEFANCNFSM4JIQVNYQ .

ptrebert commented 3 years ago

@maickrau I believe...

Optimizations for [...] chromosome scale contigs

... is what we just discussed as "should be possible to align whole genomes w/o chopping them up into smaller pieces first"

hgibling commented 2 years ago

Seeding based on minimizers from paths embedded in the graph, possibly via a GBWT index

This needs some way of picking the paths. Embedding the paths in the graph would work, it just needs to get that information from reading the GFA to building the minimizer index. Once we have the paths as vectors of node IDs the current minimizer code could be used with some changes.

In its current form does GraphAligner utilize the embedded path information in gfa files to assist in read alignment?

maickrau commented 2 years ago

The current version does not use the embedded paths.

Kinggerm commented 1 year ago

@maickrau Thank you for developing this awesome aligner! I wonder if there is a way to generate multiple hits besides the best alignment. If not, is it possible to include that in GraphAligner? It will be very helpful for some studies.

Best, Jianjun

danydoerr commented 1 year ago

@Kinggerm GraphAligner already provides this feature. Secondary alignments are output by setting the parameter --multimap-score-fraction.

Kinggerm commented 1 year ago

@Kinggerm GraphAligner already provides this feature. Secondary alignments are output by setting the parameter --multimap-score-fraction.

Uh, I should have known. I'm sorry that I misunderstood it. Thank you very much for the comment.

maickrau / GraphAligner

feature wishlist #9