medvedevgroup / TwoPaCo

A fast constructor of the compressed de Bruijn graph from many genomes
Other
39 stars 10 forks source link

Confused about + and - stranded nodes in GFA #25

Open rickbeeloo opened 3 years ago

rickbeeloo commented 3 years ago

Let's use this very simple FASTA:

>seq1
ATATGTCGCTGATCGACTGAAATAGCATCGACTAGCTATCGAT
>seq2
ATATGTCGCTGATCGACTGAATAGTGAAATAGCATCGACTAGC
>seq3
ATATGTCGCTGATCGACTTTTTTTTGAAATAGCATCGACTAGC

Then we construct the graph: ./twopaco -k 15 -f 16 test.fa -o graph and convert it to GFA: graphdump -k 15 -f gfa2 -s test.fa graph > graph.gfa:

H       VN:Z:2.0
S       36      18      ATATGTCGCTGATCGACT
F       36      seq1+   0       18$     0       18      15M
S       24      18      TTCAGTCGATCAGCGACA
F       24      seq1-   0       18$     3       21      15M
E       36+     24-     3       18$     3       18$     15M
S       14      26      GTCGATGCTATTTCAGTCGATCAGCG
F       14      seq1-   0       26$     6       32      15M
E       24-     14-     0       15      11      26$     15M
S       11      19      TGAAATAGCATCGACTAGC
F       11      seq1+   0       19$     17      36      15M
E       14-     11+     0       15      0       15      15M
S       19      22      ATAGCATCGACTAGCTATCGAT
F       19      seq1+   0       22$     21      43$     15M
E       11+     19+     4       19$     0       15      15M
O       seq1p   36+ 24- 14- 11+ 19+
F       36      seq2+   0       18$     0       18      15M
F       24      seq2-   0       18$     3       21      15M
E       36+     24-     3       18$     3       18$     15M
S       13      33      GTCGATGCTATTTCACTATTCAGTCGATCAGCG
F       13      seq2-   0       33$     6       39      15M
E       24-     13-     0       15      18      33$     15M
F       11      seq2+   0       19$     24      43$     15M
E       13-     11+     0       15      0       15      15M
O       seq2p   36+ 24- 13- 11+
F       36      seq3+   0       18$     0       18      15M
S       12      36      GTCGATGCTATTTCAAAAAAAAGTCGATCAGCGACA
F       12      seq3-   0       36$     3       39      15M
E       36+     12-     3       18$     21      36$     15M
F       11      seq3+   0       19$     24      43$     15M
E       12-     11+     0       15      0       15      15M
O       seq3p   36+ 12- 11+

When we look at the paths we have:


seq1p   36+ 24- 14- 11+ 19+
seq2p   36+ 24- 13- 11+
seq3p   36+ 12- 11+

We can only reconstruct the sequence from the GFA by taking the reverse complement of - nodes. When we look at the paths all nodes are on the same strand (i.e. all - or all +), for example, all 24 nodes are -. So why weren't these just all recorded as +?

iminkin commented 3 years ago

Indeed, for each node in the graph, we can either choose either its positive or negative strand as the "reference" representation in the GFA file. In the example you provided, it would be better to have all of them be positive for the sake of readability. However, the algorithm sometimes chooses the negative one for its internal reasons. For example, sometimes it is handy to select as the reference the strand that is lexicographically smaller or has a smaller value of a hash function: it helps to keep track of the different copies of the node in different directions. Hope this helps.

rickbeeloo commented 3 years ago

Similar to Sibelia(Z) we want to compare similarity of nodes - we want to extend on both sides of a target node and compare the sequences of the context. We know about the C++ API around the twopaco graph, however, we are not familiar with C++ hence we looked for a graphdump format that we could parse. You think the GFA is suited for this (regardless of those strands) or could we better use the .dot for example?

iminkin commented 3 years ago

Honestly, I think the best format suited for your purpose is the junction list format. SibeliaZ basically works using some form of it, and it is the simplest possible (at least from my point of view) representation of the graph.

rickbeeloo commented 3 years ago

Hi @iminkin I took a close look at this however I'm pretty confused about how we could exactly use this to find shared regions between multiple genomes? Let's take a really simple example:

>1
TGGCACTTCTTTGCTAGCT
>2
TGGCACTTCAAAGCTAGCT

When aligned:

1                  1 TGGCACTTCTTTGCTAGCT     19
                     |||||||||...|||||||
2                  1 TGGCACTTCAAAGCTAGCT     19

Then i created a graph using: twopaco -k 3 -f 16 test.fasta -o graph and got the .seq file with graphdump:

0 0 5
0 2 2
0 5 6
0 6 4
0 8 6
0 10 -1
0 11 -2
0 12 -3
0 15 3
0 16 -3
1 0 5
1 2 2
1 5 6
1 6 4
1 8 1
1 10 -6
1 11 3
1 12 -3
1 15 3
1 16 -3

When we get the junction paths:

5, 2, 6, 4, 6, -1, -2, -3, 3, -3 5, 2, 6, 4, 1, -6, 3, -3, 3, -3

We see they differ at indices 8 till 11 which is the 3-mer difference. So can we then say as long as two sequences have the same array of junctions they are the same sequence (like in the bold above)?

iminkin commented 3 years ago

Yes, the whole point of the compacted graph is to compress identical substrings of length at least k into integers. Those integers are much easier to index and compare: if you want to see if two substrings are equal just check if they consist of the same junctions.