nanoporetech / pore-c

Pore-C support
Mozilla Public License 2.0
52 stars 5 forks source link

graph-based concatemer alignment #53

Open callumparr opened 2 years ago

callumparr commented 2 years ago

I was just reading your nice nature biotech paper and it has more detail now about what the pore-c tools do. I was interested in this graph based concatemer alignment step.

I was wondering why has to be so complicated. Once you have the sub-read alignments, filter low map scores, and keep one alignment if multiple overlaps on the portion of reading, why can you not build up a list of contacts from this.

For completeness, I guess you should make sure that at the start and end of each alignment, the position in the genome should be demarcated by a restriction site. Although I guess gDNA could be randomly fragmented in situ and still under go proximity ligation. albeit less efficent than the RE mediated complementary overhang ligat I cannot understand what this graphing process does or why it is needed.

Graph-based concatemer alignment
We developed an algorithm for Pore-C read alignment with the goal of identifying the most likely set of monomers comprising a given concatemer. Because each Pore-C read is a chimera of (potentially) distant monomeric sequences, the processing of these reads requires performing a multiway split alignment. The goal of our split alignment algorithm was to identify a set of minimally overlapping alignments that maximally cover the read sequence.

In this algorithm, raw sequencing data are initially aligned with bwa bwasw with parameters –b 5 –q 2 –r 1 –T 15 –z 10, allowing supplementary and secondary alignments for each read. We remove low mapping quality (mapping quality = 0) alignments. In addition, for each set of alignments with identical positions on the read, we take the single highest scoring alignment. The remaining alignments are used to build a weighted directed acyclic graph whose optimal traversal determines the set of final, filtered alignments (that is, monomers) associated with the read (that is, concatemer).

The nodes of this directed acyclic graph G = (V, E) are composed of the alignments as well two additional nodes (START, END) representing the start and end, respectively, of the read. Each node a ∈ V is associated with a start s(a) and end e(a) position on the read. Each alignment a ∈ V is also associated with an alignment score AS(a). Directed edges (a, b) ∈ E connect an alignment pair a, b ∈ V if e(a) < e(b) or otherwise if e(a) = e(b) and s(a) < s(b). Additional directed edges connect START to each alignment node and each alignment node to END. The resulting graph structure has a topological ordering from the START to the END node.

Each edge (a, b) ∈ E is assigned a weight w((a, b)) as a function of the alignment score and start and end properties of the associated nodes. For edges (a, b), where a is the START node, w((a, b)) = s(b) − AS(b). For edges (a, b), where b is the END node, w((a, b)) = s(b) − e(a). Finally, for all remaining edges (a, b), where both a and b are alignments, w((a, b)) = ∣e(a) − s(b)∣ − AS(b). The Bellman–Ford algorithm is then applied to this weighted graph (G, w) to identify the minimal scoring path from the START to END node.

In this formulation, edges between overlapping or distant alignment pairs are penalized. Similarly, edges connecting to low-scoring alignments will receive a higher weight. As a result, the minimally weighted path, as computed by the Bellman–Ford algorithm, will identify a set of minimally overlapping high-quality alignments that maximally cover the read. These are returned as the set of monomers associated with the given read/concatemer.
imielinski commented 2 years ago

Thanks for the deep read of the paper, and for starting this interesting discussion. It may indeed be possible to simplify the alignment algorithm.

The key issue is that the four cutter protocol produces short sequences (300-400) that create some significant alignment ambiguity, including alignments that nest and overlap on the concatemer query. The goal is to choose a high quality and parsimonious set of alignments that span the query - and this is one approach for achieving that. There may be others.

Alignment ambiguity would be much less of an issue with longer restriction fragments (eg six cutter, which we tried) and/or higher fidelity reads. However, longer restriction fragments which we tried reduced the contact cardinality / order among other things and so were suboptimal for our application. However a six cutter protocol might work better for de novo assembly where accurate mapping particularly of repetitive sequences is a priority.

On Mon, Jun 6, 2022 at 11:06 AM callumparr @.***> wrote:

I was just reading your nice nature biotech paper and it has more detail now about what the pore-c tools do. I was interested in this graph based concatemer alignment step.

I was wondering why has to be so complicated. Once you have the sub-read alignments, filter low map scores, and keep one alignment if multiple overlaps on the portion of reading, why can you not build up a list of contacts from this.

For completeness, I guess you should make sure that at the start and end of each alignment, the position in the genome should be demarcated by a restriction site. Although I guess gDNA could be randomly fragmented in situ and still under go proximity ligation. albeit less efficent than the RE mediated complementary overhang ligat I cannot understand what this graphing process does or why it is needed.

Graph-based concatemer alignment

We developed an algorithm for Pore-C read alignment with the goal of identifying the most likely set of monomers comprising a given concatemer. Because each Pore-C read is a chimera of (potentially) distant monomeric sequences, the processing of these reads requires performing a multiway split alignment. The goal of our split alignment algorithm was to identify a set of minimally overlapping alignments that maximally cover the read sequence.

In this algorithm, raw sequencing data are initially aligned with bwa bwasw with parameters –b 5 –q 2 –r 1 –T 15 –z 10, allowing supplementary and secondary alignments for each read. We remove low mapping quality (mapping quality = 0) alignments. In addition, for each set of alignments with identical positions on the read, we take the single highest scoring alignment. The remaining alignments are used to build a weighted directed acyclic graph whose optimal traversal determines the set of final, filtered alignments (that is, monomers) associated with the read (that is, concatemer).

The nodes of this directed acyclic graph G = (V, E) are composed of the alignments as well two additional nodes (START, END) representing the start and end, respectively, of the read. Each node a ∈ V is associated with a start s(a) and end e(a) position on the read. Each alignment a ∈ V is also associated with an alignment score AS(a). Directed edges (a, b) ∈ E connect an alignment pair a, b ∈ V if e(a) < e(b) or otherwise if e(a) = e(b) and s(a) < s(b). Additional directed edges connect START to each alignment node and each alignment node to END. The resulting graph structure has a topological ordering from the START to the END node.

Each edge (a, b) ∈ E is assigned a weight w((a, b)) as a function of the alignment score and start and end properties of the associated nodes. For edges (a, b), where a is the START node, w((a, b)) = s(b) − AS(b). For edges (a, b), where b is the END node, w((a, b)) = s(b) − e(a). Finally, for all remaining edges (a, b), where both a and b are alignments, w((a, b)) = ∣e(a) − s(b)∣ − AS(b). The Bellman–Ford algorithm is then applied to this weighted graph (G, w) to identify the minimal scoring path from the START to END node.

In this formulation, edges between overlapping or distant alignment pairs are penalized. Similarly, edges connecting to low-scoring alignments will receive a higher weight. As a result, the minimally weighted path, as computed by the Bellman–Ford algorithm, will identify a set of minimally overlapping high-quality alignments that maximally cover the read. These are returned as the set of monomers associated with the given read/concatemer.

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/pore-c/issues/53, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFUFYYS43A4EDOOZDCJSKDVNYHYPANCNFSM5X7VKG3Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Marcin Imielinski M.D., Ph.D. Lab head, www.mskilab.org https://github.com/mskilab

Associate Professor of Pathology and Laboratory Medicine Associate Professor of Computational Genomics Englander Institute for Precision Medicine Sandra and Edward Meyer Cancer Center Weill Cornell Medical College 1300 York Ave, New York, NY em: @.*** ph: (646) 962-6997

Attending Molecular Pathologist New York-Presbyterian Hospital 525 E 68th St, New York, NY

Core Member New York Genome Center 101 6th Ave, New York, NY em: @.*** ph: (646) 977-7214

--

This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

callumparr commented 2 years ago

Thanks for the deep read of the paper, and for starting this interesting discussion. It may indeed be possible to simplify the alignment algorithm. The key issue is that the four cutter protocol produces short sequences (300-400) that create some significant alignment ambiguity, including alignments that nest and overlap on the concatemer query. The goal is to choose a high quality and parsimonious set of alignments that span the query - and this is one approach for achieving that. There may be others. Alignment ambiguity would be much less of an issue with longer restriction fragments (eg six cutter, which we tried) and/or higher fidelity reads. However, longer restriction fragments which we tried reduced the contact cardinality / order among other things and so were suboptimal for our application. However a six cutter protocol might work better for de novo assembly where accurate mapping particularly of repetitive sequences is a priority.

Hey, thank you for the reply. It helps me understand the process behind the data processing steps.

callumparr commented 2 years ago

Sorry reopened as thought of something else.

would multi-contact reads with a bridge adapter in between contacts aid in this? What if we could split the reads where we find an internal adapter and align the sub-fragments as independent reads to the reference? Disregarding some ligations that may occur via blunt ligation rather than mediated by bridge adapter, each read could be aligned in a linear fashion.