Bubble removal and unitig construction while retaining original coordinates

vgteam / vg

tools for working with genome variation graphs

https://biostars.org/tag/vg/

Other

1.09k stars 193 forks source link

Bubble removal and unitig construction while retaining original coordinates #2979

Open rickbeeloo opened 3 years ago

rickbeeloo commented 3 years ago

I'm aware questions are preferably answered at Biostars, however, I saw older questions not being addressed there while being here. If this is not convenient just notify me and I will immediately transfer it to Biostars!

Simple example We created three simple sequences of which two are identical except an SNP, and one has a new subsequence. We constructed the GFA using SeqWish after aligning them using Minimap2: We used GFAtools (gfatools asm -b 10 -u) to remove bubbles and hereafter merge the nodes to unitigs.

Question As GFAtools is based on Miniasm, and thus "genome assembly", it does not (yet) remember the original paths. Hence, in the unitig graph, we are unable to tell the locations of the nodes/fragments in the input genomes (which we are interested in). Therefore we wonder whether/how we can pop bubbles and get the unitigs using VG? We took a look at vg simplify however this seems to require BED and VCF files instead of our GFA.

ekg commented 3 years ago

gfatools appears to be unable to record or work with the embedded paths. In this case there is a simple workaround. Map the unitigs back into the original graph. You can then annotate the positions of all the nodes in the original graph with xg's GFA output (https://github.com/vgteam/xg).

The trivial way to do this is to add the unitigs to your source FASTA when you're running seqwish. Align them to the other sequences as well. Then, when you re-induce the graph, the unitigs will be embedded and you'll be able to get the positions of the bubbles relative to them.

I think this will do what you want:

xg -g input.gfa -G >position_annotated.gfa

rickbeeloo commented 3 years ago

Thank you @ekg for the quick response!

The simplified data

source.gfa (= left in the previous figure)

H   VN:Z:1.0
S   1   ATCGACTGACACGATCGACTA
S   2   C
S   3   GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG
S   4   ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
S   5   G
S   6   TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
L   1   +   2   +   0M
L   1   +   5   +   0M
L   2   +   3   +   0M
L   3   +   4   +   0M
L   5   +   3   +   0M
L   6   +   4   +   0M
P   genome1 1+,2+,3+,4+ *,*,*
P   genome2 1+,5+,3+,4+ *,*,*
P   genome3 6+,4+   *

unitig.gfa (= right in the previous figure)

S   utg0000001l ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG  LN:i:74 RC:i:3  lc:i:74
A   utg0000001l 0   +   1   0   21
A   utg0000001l 21  +   5   0   1
A   utg0000001l 22  +   3   0   52
S   utg0000002l ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC   LN:i:289    RC:i:1  lc:i:289
A   utg0000002l 0   +   4   0   289
S   utg0000003l TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC  LN:i:110    RC:i:1  lc:i:110
A   utg0000003l 0   +   6   0   110
L   utg0000001l +   utg0000002l +   0M  L1:i:74 L2:i:289
L   utg0000002l -   utg0000003l -   0M  L1:i:289    L2:i:110

Mapping attempt I was not sure where I'm supposed to use the xg command you provided earlier, but this is what I tried now for the mapping:

# GFA to vg
vg view -F source.gfa -v > source.vg

# Index
vg index -x source_index.gx -g source_index.gcsa source.vg

# Unitig graph to FASTA
gfatools gfa2fa unitig.gfa > query.fasta

# Query the FASTA
vg map -F query.fasta -M 10 -x source_index.gx -g source_index.gcsa -j

However, this is different from what I would expect. For example, let's look at the json for utg0000001l it only returns the matching nodes 1, 5, 3 for genome2 but not for genome1 (although it is mentioned in refpos: of course 1 and 3 overlap but 2 is absent from the output).

{
   "identity":1.0,
   "mapping_quality":60,
   "name":"utg0000001l",
   "path":{
      "mapping":[
         {
            "edit":[
               {
                  "from_length":21,
                  "to_length":21
               }
            ],
            "position":{
               "node_id":"1"
            },
            "rank":"1"```
         },
         {
            "edit":[
               {
                  "from_length":1,
                  "to_length":1
               }
            ],
            "position":{
               "node_id":"5"
            },
            "rank":"2"
         },
         {
            "edit":[
               {
                  "from_length":52,
                  "to_length":52
               }
            ],
            "position":{
               "node_id":"3"
            },
            "rank":"3"
         }

Questions

Am I missing something here as I thought node 2 from genome1 of source.gfa should also be included in the mapping?
Where exactly did you mean to use the xg command you provided?

Thank you in advance!

ekg commented 3 years ago

Take the sequences that you used as input and the sequences of the unitigs.

Make sure they have unique names.

Concatenate them into a single file (s.fa).

Map them all together with minimap2 -X s.fa s.fa >s.paf.

Induce the variation graph with seqwish.

seqwish -s s.fa -p s.paf -g s.gfa

Index the graph using xg to get the coordinates of each node relative to all overlapping paths.

xg -g s.gfa -G >a.gfa

Now in a.gfa you will have a coordinate descriptor tag in JSON for each node. You can use this to find the coordinates of the bubbles versus your unitigs.

On Tue, Sep 15, 2020, 14:34 rickbeeloo notifications@github.com wrote:

The simplified data

source.gfa (= left in the previous figure)

H VN:Z:1.0 S 1 ATCGACTGACACGATCGACTA S 2 C S 3 GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG S 4 ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC S 5 G S 6 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC L 1 + 2 + 0M L 1 + 5 + 0M L 2 + 3 + 0M L 3 + 4 + 0M L 5 + 3 + 0M L 6 + 4 + 0M P genome1 1+,2+,3+,4+ ,, P genome2 1+,5+,3+,4+ ,, P genome3 6+,4+ *

unitig.gfa (= right in the previous figure)

S utg0000001l ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG LN:i:74 RC:i:3 lc:i:74 A utg0000001l 0 + 1 0 21 A utg0000001l 21 + 5 0 1 A utg0000001l 22 + 3 0 52 S utg0000002l ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC LN:i:289 RC:i:1 lc:i:289 A utg0000002l 0 + 4 0 289 S utg0000003l TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC LN:i:110 RC:i:1 lc:i:110 A utg0000003l 0 + 6 0 110 L utg0000001l + utg0000002l + 0M L1:i:74 L2:i:289 L utg0000002l - utg0000003l - 0M L1:i:289 L2:i:110

Mapping attempt I was not sure where I'm supposed to use the xg command you provided earlier, but this is what I tried now for the mapping:

GFA to vg

vg view -F source.gfa -v > source.vg

Index

vg index -x source_index.gx -g source_index.gcsa source.vg

Unitig graph to FASTA

gfatools gfa2fa unitig.gfa > query.fasta

Query the FASTA

vg map -F query.fasta -M 10 -x source_index.gx -g source_index.gcsa -j

However, this is different from what I would expect. For example, let's look at the json for utg0000001l it only returns the matching nodes 1, 5, 3 for genome2 but not for genome1 (of course 1 and 3 overlap but 2 is absent from the output). [image: image] https://user-images.githubusercontent.com/19516376/93210373-c490d000-f75f-11ea-82c0-491b80f97889.png

{ "identity":1.0, "mapping_quality":60, "name":"utg0000001l", "path":{ "mapping":[ { "edit":[ { "from_length":21, "to_length":21 } ], "position":{ "node_id":"1" }, "rank":"1"``` }, { "edit":[ { "from_length":1, "to_length":1 } ], "position":{ "node_id":"5" }, "rank":"2" }, { "edit":[ { "from_length":52, "to_length":52 } ], "position":{ "node_id":"3" }, "rank":"3" } ] }, "refpos":[ { "name":"genome1" }, { "name":"genome2" } ], "score":84, "sequence":"ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG", "time_used":351.0 }

Questions

Am I missing something here? I thought the -M parameter would allow more than one match and thereby include genome1 too.

Where exactly did you mean to use the xg command you provided?

Thank you in advance!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2979#issuecomment-692684924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJZCUK4CXAIVAEI2JDSF5NNZANCNFSM4RLXDCOQ .

ekg commented 3 years ago

By the way, you can probably apply vg deconstruct at the end rather than xg. Then you will get a VCF. You would specify the unitig names as references. The xg route is just very simple.

On Tue, Sep 15, 2020, 15:30 Erik Garrison erik.garrison@gmail.com wrote:

Take the sequences that you used as input and the sequences of the unitigs.

Make sure they have unique names.

Concatenate them into a single file (s.fa).

Map them all together with minimap2 -X s.fa s.fa >s.paf.

Induce the variation graph with seqwish.

seqwish -s s.fa -p s.paf -g s.gfa

Index the graph using xg to get the coordinates of each node relative to all overlapping paths.

xg -g s.gfa -G >a.gfa

Now in a.gfa you will have a coordinate descriptor tag in JSON for each node. You can use this to find the coordinates of the bubbles versus your unitigs.

On Tue, Sep 15, 2020, 14:34 rickbeeloo notifications@github.com wrote:

The simplified data

source.gfa (= left in the previous figure)

H VN:Z:1.0 S 1 ATCGACTGACACGATCGACTA S 2 C S 3 GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG S 4 ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC S 5 G S 6 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC L 1 + 2 + 0M L 1 + 5 + 0M L 2 + 3 + 0M L 3 + 4 + 0M L 5 + 3 + 0M L 6 + 4 + 0M P genome1 1+,2+,3+,4+ ,, P genome2 1+,5+,3+,4+ ,, P genome3 6+,4+ *

unitig.gfa (= right in the previous figure)

S utg0000001l ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG LN:i:74 RC:i:3 lc:i:74 A utg0000001l 0 + 1 0 21 A utg0000001l 21 + 5 0 1 A utg0000001l 22 + 3 0 52 S utg0000002l ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC LN:i:289 RC:i:1 lc:i:289 A utg0000002l 0 + 4 0 289 S utg0000003l TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC LN:i:110 RC:i:1 lc:i:110 A utg0000003l 0 + 6 0 110 L utg0000001l + utg0000002l + 0M L1:i:74 L2:i:289 L utg0000002l - utg0000003l - 0M L1:i:289 L2:i:110

Mapping attempt I was not sure where I'm supposed to use the xg command you provided earlier, but this is what I tried now for the mapping:

GFA to vg

vg view -F source.gfa -v > source.vg

Index

vg index -x source_index.gx -g source_index.gcsa source.vg

Unitig graph to FASTA

gfatools gfa2fa unitig.gfa > query.fasta

Query the FASTA

vg map -F query.fasta -M 10 -x source_index.gx -g source_index.gcsa -j

However, this is different from what I would expect. For example, let's look at the json for utg0000001l it only returns the matching nodes 1, 5, 3 for genome2 but not for genome1 (of course 1 and 3 overlap but 2 is absent from the output). [image: image] https://user-images.githubusercontent.com/19516376/93210373-c490d000-f75f-11ea-82c0-491b80f97889.png

{ "identity":1.0, "mapping_quality":60, "name":"utg0000001l", "path":{ "mapping":[ { "edit":[ { "from_length":21, "to_length":21 } ], "position":{ "node_id":"1" }, "rank":"1"``` }, { "edit":[ { "from_length":1, "to_length":1 } ], "position":{ "node_id":"5" }, "rank":"2" }, { "edit":[ { "from_length":52, "to_length":52 } ], "position":{ "node_id":"3" }, "rank":"3" } ] }, "refpos":[ { "name":"genome1" }, { "name":"genome2" } ], "score":84, "sequence":"ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG", "time_used":351.0 }

Questions

Am I missing something here? I thought the -M parameter would allow more than one match and thereby include genome1 too.

Where exactly did you mean to use the xg command you provided?

Thank you in advance!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2979#issuecomment-692684924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJZCUK4CXAIVAEI2JDSF5NNZANCNFSM4RLXDCOQ .

rickbeeloo commented 3 years ago

Hi @ekg, thanks again! I ran the exact steps you suggested however some unexpected things appear :(

I first created the new FASTA, with unitigs and the original sequences, as you suggested: 1. cat query.fasta source.fa > s.fa Which thus looks like this:

>utg0000001l
ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG
>utg0000002l
ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
>utg0000003l
TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
>genome1
ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
>genome2
ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
>genome3
TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC

2. minimap2 -X s.fa s.fa >s.paf (not as expected)

The s.paf content:

genome1 363 24  356 +   genome2 363 24  356 332 332 0   tp:A:S  cm:i:58 s1:i:332    dv:f:0.0011 rl:i:0
genome1 363 83  356 +   utg0000002l 289 9   282 273 273 0   tp:A:S  cm:i:46 s1:i:273    dv:f:0  rl:i:0
genome1 363 83  356 +   genome3 399 119 392 273 273 0   tp:A:S  cm:i:46 s1:i:273    dv:f:0.0014 rl:i:0
genome1 363 24  74  +   utg0000001l 74  24  74  50  50  0   tp:A:S  cm:i:9  s1:i:50 dv:f:0.0070 rl:i:0
genome2 363 83  356 +   utg0000002l 289 9   282 273 273 0   tp:A:S  cm:i:46 s1:i:273    dv:f:0  rl:i:0
genome2 363 83  356 +   genome3 399 119 392 273 273 0   tp:A:S  cm:i:46 s1:i:273    dv:f:0.0014 rl:i:0
genome2 363 4   74  +   utg0000001l 74  4   74  70  70  0   tp:A:S  cm:i:12 s1:i:70 dv:f:0  rl:i:0
genome3 399 119 392 +   utg0000002l 289 9   282 273 273 0   tp:A:S  cm:i:46 s1:i:273    dv:f:0  rl:i:0
genome3 399 2   102 +   utg0000003l 110 2   102 100 100 0   tp:A:S  cm:i:14 s1:i:100    dv:f:0  rl:i:0

We know that genome1 and genome2 are identical except for one SNP. utg0000001l should cover positions 0-75 of both genome1 and genome2. In fact, utg0000001l is identical to positions 0-75 of genome2. However, upon inspection of this output we notice that neither of the genomes is correctly covered (does not start at 0) nor are they equally aligned (different start positions). Notably, the SNP is at position 22 probably influencing this start position difference here too, however, if I understood correctly, the goal of this approach was to actually make the unitig cover the exact parts it was originally derived from. Likewise, we know that utg0000003l should start at position 0 on genome3 but it starts at position 2.

I could resolve this by setting the -w (minimizer window size) parameter to 2: minimap2 -X -w 2 s.fa s.fa >s.paf:

genome1 363 0   362 +   genome2 363 0   362 359 362 0   tp:A:S  cm:i:223    s1:i:359    dv:f:0.0032 rl:i:0
genome1 363 75  362 +   genome3 399 111 398 287 287 0   tp:A:S  cm:i:184    s1:i:287    dv:f:0.0004 rl:i:0
genome1 363 75  362 +   utg0000002l 289 1   288 287 287 0   tp:A:S  cm:i:184    s1:i:287    dv:f:0  rl:i:0
genome1 363 0   74  +   utg0000001l 74  0   74  71  74  0   tp:A:S  cm:i:29 s1:i:71 dv:f:0.0214 rl:i:0
genome2 363 75  362 +   genome3 399 111 398 287 287 0   tp:A:S  cm:i:184    s1:i:287    dv:f:0.0004 rl:i:0
genome2 363 75  362 +   utg0000002l 289 1   288 287 287 0   tp:A:S  cm:i:184    s1:i:287    dv:f:0  rl:i:0
genome2 363 0   74  +   utg0000001l 74  0   74  74  74  0   tp:A:S  cm:i:39 s1:i:74 dv:f:0  rl:i:0
genome3 399 111 398 +   utg0000002l 289 1   288 287 287 0   tp:A:S  cm:i:184    s1:i:287    dv:f:0  rl:i:0
genome3 399 0   110 +   utg0000003l 110 0   110 110 110 0   tp:A:S  cm:i:68 s1:i:110    dv:f:0  rl:i:0

However, when resolving the unitigs heavily depends on the parameters I do wonder how robust this approach is when extending beyond thousands of genomes (which is our plan) where unitigs can be very small till almost the sizes of whole genomes.

3. seqwish -s s.fa -p s.paf -g s.gfa With s.gfa looking like:

H   VN:Z:1.0
S   1   ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG
S   2   ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
S   3   TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
S   4   ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
S   5   ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
S   6   TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
P   utg0000001l 1+  *
P   utg0000002l 2+  *
P   utg0000003l 3+  *
P   genome1 4+  *
P   genome2 5+  *
P   genome3 6+  *

Here we see that each sequence gets its own node rather than the overlap it should have?

4. xg -g s.gfa -G >a.gfa With a.gfa:

H   VN:Z:1.0
S   1   ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG  PO:J:[["utg0000001l","+",0]]
S   2   ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC   PO:J:[["utg0000002l","+",0]]
S   3   TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC  PO:J:[["utg0000003l","+",0]]
S   4   ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC PO:J:[["genome1","+",0]]
S   5   ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC PO:J:[["genome2","+",0]]
S   6   TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC PO:J:[["genome3","+",0]]

rickbeeloo commented 3 years ago

Sorry just realized you did not provide -c for the minimap2 command in your example and as you recently noticed (https://github.com/ekg/seqwish/issues/61) it didn't trigger a warning for me. I will report back after I run minimap -c!

ekg commented 3 years ago

You need to tell minimap2 to produce cigars to use it in seqwish. Add -c to the mapping command. Otherwise there is no way to know what base pairs map to others.

On Wed, Sep 16, 2020, 11:13 rickbeeloo notifications@github.com wrote:

Hi @ekg https://github.com/ekg, thanks again! I ran the exact steps you suggested however some unexpected things appear :(

I first created the new FASTA, with unitigs and the original sequences, as you suggested: 1. cat query.fasta source.fa > s.fa Which thus looks like this:

utg0000001l ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCG ACATCGATCGACTG utg0000002l ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCA TCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACT GACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCT ATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCG ACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC utg0000003l TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGC TACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC genome1 ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC genome2 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC genome3 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC

2. minimap2 -X s.fa s.fa >s.paf (not as expected)

The s.paf content:

genome1 363 24 356 + genome2 363 24 356 332 332 0 tp:A:S cm:i:58 s1:i:332 dv:f:0.0011 rl:i:0 genome1 363 83 356 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0 genome1 363 83 356 + genome3 399 119 392 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0.0014 rl:i:0 genome1 363 24 74 + utg0000001l 74 24 74 50 50 0 tp:A:S cm:i:9 s1:i:50 dv:f:0.0070 rl:i:0 genome2 363 83 356 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0 genome2 363 83 356 + genome3 399 119 392 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0.0014 rl:i:0 genome2 363 4 74 + utg0000001l 74 4 74 70 70 0 tp:A:S cm:i:12 s1:i:70 dv:f:0 rl:i:0 genome3 399 119 392 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0 genome3 399 2 102 + utg0000003l 110 2 102 100 100 0 tp:A:S cm:i:14 s1:i:100 dv:f:0 rl:i:0

We know that genome1 and genome2 are identical except for one SNP. utg0000001l should cover positions 1-74 of both genome1 and genome2. In fact, utg0000001l is identical to positions 1-74 of genome2. However, upon inspection of this output we notice that neither of the genomes is correctly covered (does not start at 1) nor are they equally aligned (different start positions):

genome1 363 24 74 + utg0000001l genome2 363 4 74 + utg0000001l

Perhaps parameter settings can fix this or I can make an issue on the Github of Heng Li, however, I then wonder how reliable this approach is, especially since utg0000001l is even identical to the first part of genome2. Moreover, we plan to extend this to thousands of genomes.

I just continued regardless just to see what the GFA would look like in the end.

3. seqwish -s s.fa -p s.paf -g s.gfa With s.gfa looking like:

H VN:Z:1.0 S 1 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG S 2 ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC S 3 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC S 4 ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC S 5 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC S 6 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC P utg0000001l 1+ P utg0000002l 2+ P utg0000003l 3+ P genome1 4+ P genome2 5+ P genome3 6+

Here we see that each sequence gets its own node rather than the overlap it should have.

Any suggestions?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2979#issuecomment-693280172, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEMVSNQB62W7T4OGXRLSGB6TDANCNFSM4RLXDCOQ .

ekg commented 3 years ago

This is why you get a graph with a separate node for each contig.

On Wed, Sep 16, 2020, 14:38 Erik Garrison erik.garrison@gmail.com wrote:

You need to tell minimap2 to produce cigars to use it in seqwish. Add -c to the mapping command. Otherwise there is no way to know what base pairs map to others.

On Wed, Sep 16, 2020, 11:13 rickbeeloo notifications@github.com wrote:

Hi @ekg https://github.com/ekg, thanks again! I ran the exact steps you suggested however some unexpected things appear :(

I first created the new FASTA, with unitigs and the original sequences, as you suggested: 1. cat query.fasta source.fa > s.fa Which thus looks like this:

utg0000001l ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCG ACATCGATCGACTG utg0000002l ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCA TCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACT GACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCT ATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCG ACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC utg0000003l TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGC TACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC genome1 ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC genome2 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC genome3 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC

2. minimap2 -X s.fa s.fa >s.paf (not as expected)

The s.paf content:

genome1 363 24 356 + genome2 363 24 356 332 332 0 tp:A:S cm:i:58 s1:i:332 dv:f:0.0011 rl:i:0 genome1 363 83 356 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0 genome1 363 83 356 + genome3 399 119 392 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0.0014 rl:i:0 genome1 363 24 74 + utg0000001l 74 24 74 50 50 0 tp:A:S cm:i:9 s1:i:50 dv:f:0.0070 rl:i:0 genome2 363 83 356 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0 genome2 363 83 356 + genome3 399 119 392 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0.0014 rl:i:0 genome2 363 4 74 + utg0000001l 74 4 74 70 70 0 tp:A:S cm:i:12 s1:i:70 dv:f:0 rl:i:0 genome3 399 119 392 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0 genome3 399 2 102 + utg0000003l 110 2 102 100 100 0 tp:A:S cm:i:14 s1:i:100 dv:f:0 rl:i:0

We know that genome1 and genome2 are identical except for one SNP. utg0000001l should cover positions 1-74 of both genome1 and genome2. In fact, utg0000001l is identical to positions 1-74 of genome2. However, upon inspection of this output we notice that neither of the genomes is correctly covered (does not start at 1) nor are they equally aligned (different start positions):

genome1 363 24 74 + utg0000001l genome2 363 4 74 + utg0000001l

Perhaps parameter settings can fix this or I can make an issue on the Github of Heng Li, however, I then wonder how reliable this approach is, especially since utg0000001l is even identical to the first part of genome2. Moreover, we plan to extend this to thousands of genomes.

I just continued regardless just to see what the GFA would look like in the end.

3. seqwish -s s.fa -p s.paf -g s.gfa With s.gfa looking like:

H VN:Z:1.0 S 1 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG S 2 ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC S 3 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC S 4 ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC S 5 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC S 6 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC P utg0000001l 1+ P utg0000002l 2+ P utg0000003l 3+ P genome1 4+ P genome2 5+ P genome3 6+

Here we see that each sequence gets its own node rather than the overlap it should have.

Any suggestions?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2979#issuecomment-693280172, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEMVSNQB62W7T4OGXRLSGB6TDANCNFSM4RLXDCOQ .

ekg commented 3 years ago

I'll add a warning to seqwish when this is encountered. It's an obvious gotcha. Sorry for the confusion.

On Wed, Sep 16, 2020, 14:38 Erik Garrison erik.garrison@gmail.com wrote:

You need to tell minimap2 to produce cigars to use it in seqwish. Add -c to the mapping command. Otherwise there is no way to know what base pairs map to others.

On Wed, Sep 16, 2020, 11:13 rickbeeloo notifications@github.com wrote:

Hi @ekg https://github.com/ekg, thanks again! I ran the exact steps you suggested however some unexpected things appear :(

I first created the new FASTA, with unitigs and the original sequences, as you suggested: 1. cat query.fasta source.fa > s.fa Which thus looks like this:

utg0000001l ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCG ACATCGATCGACTG utg0000002l ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCA TCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACT GACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCT ATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCG ACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC utg0000003l TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGC TACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC genome1 ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC genome2 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC genome3 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC

2. minimap2 -X s.fa s.fa >s.paf (not as expected)

The s.paf content:

genome1 363 24 356 + genome2 363 24 356 332 332 0 tp:A:S cm:i:58 s1:i:332 dv:f:0.0011 rl:i:0 genome1 363 83 356 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0 genome1 363 83 356 + genome3 399 119 392 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0.0014 rl:i:0 genome1 363 24 74 + utg0000001l 74 24 74 50 50 0 tp:A:S cm:i:9 s1:i:50 dv:f:0.0070 rl:i:0 genome2 363 83 356 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0 genome2 363 83 356 + genome3 399 119 392 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0.0014 rl:i:0 genome2 363 4 74 + utg0000001l 74 4 74 70 70 0 tp:A:S cm:i:12 s1:i:70 dv:f:0 rl:i:0 genome3 399 119 392 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0 genome3 399 2 102 + utg0000003l 110 2 102 100 100 0 tp:A:S cm:i:14 s1:i:100 dv:f:0 rl:i:0

We know that genome1 and genome2 are identical except for one SNP. utg0000001l should cover positions 1-74 of both genome1 and genome2. In fact, utg0000001l is identical to positions 1-74 of genome2. However, upon inspection of this output we notice that neither of the genomes is correctly covered (does not start at 1) nor are they equally aligned (different start positions):

genome1 363 24 74 + utg0000001l genome2 363 4 74 + utg0000001l

Perhaps parameter settings can fix this or I can make an issue on the Github of Heng Li, however, I then wonder how reliable this approach is, especially since utg0000001l is even identical to the first part of genome2. Moreover, we plan to extend this to thousands of genomes.

I just continued regardless just to see what the GFA would look like in the end.

3. seqwish -s s.fa -p s.paf -g s.gfa With s.gfa looking like:

H VN:Z:1.0 S 1 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG S 2 ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC S 3 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC S 4 ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC S 5 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC S 6 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC P utg0000001l 1+ P utg0000002l 2+ P utg0000003l 3+ P genome1 4+ P genome2 5+ P genome3 6+

Here we see that each sequence gets its own node rather than the overlap it should have.

Any suggestions?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2979#issuecomment-693280172, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEMVSNQB62W7T4OGXRLSGB6TDANCNFSM4RLXDCOQ .

rickbeeloo commented 3 years ago

Thanks again @ekg, however, an interesting problem remains. I was looking at how I could format the output and noticed the following:

So what we see here is that the bubble we initially popped, namely the SNP at position 22 (see the first figure I sent) re-appears here. This makes sense as the bubble popping took the genome2 sequence with G at 22 as "replacement" rather than C from genome1. This is detected by SeqWish resulting in a separate node: S 13 C PO:J:[["genome1","+",21]]. Thus this approach does not seem to work with bubble popping unless we:

Would somehow remember where bubble popping occurred
Post-process the a.gfa where we assume that small windows of "missing coverage" were previous bubbles - and thus ignore this.
Perhaps replace the nucleotides in a bubble for N
Just use the minimap output directly? (see below)

Minimap Perhaps we can just use the minmap output directly as this correctly tells us that e.g. utg0000001l covers 1-74 of genome1 and genome2. To make this more visual I plotted the minimap output where the blocks represent the found alignments for each genome (the unitigs in orange):

Each unitig indeed correctly covers the genomes. In this way, we can still extract the unitig sequences from the unitig GFA and their genomic locations from minimap. In fact, I suppose we can take the unitig GFA and store the minimap positions in the JSON - creating the "location graph". Perhaps I'm missing something in my reason, as I suppose you proposed re-inducing the graph using seqwish for a reason?

The a.gfa (in case you wonder/need this)

H   VN:Z:1.0
S   1   ATCGACTGACACGATCGACTA   PO:J:[["utg0000001l","+",0],["genome1","+",0],["genome2","+",0]]
L   1   +   2   +   0M
L   1   +   13  +   0M
S   2   G   PO:J:[["utg0000001l","+",21],["genome2","+",21]]
L   2   +   3   +   0M
S   3   GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACAT  PO:J:[["utg0000001l","+",22],["genome1","+",22],["genome2","+",22]]
L   3   +   4   +   0M
S   4   CGATCGA PO:J:[["utg0000001l","+",64],["utg0000003l","+",98],["genome1","+",64],["genome2","+",64],["genome3","+",98]]
L   4   +   5   +   0M
L   4   +   10  +   0M
S   5   C   PO:J:[["utg0000001l","+",71],["utg0000003l","+",106],["genome1","+",71],["genome2","+",71],["genome3","+",106]]
L   5   +   6   +   0M
L   5   +   11  +   0M
S   6   T   PO:J:[["utg0000001l","+",72],["genome1","+",72],["genome2","+",72]]
L   6   +   7   +   0M
S   7   G   PO:J:[["utg0000001l","+",73],["utg0000003l","+",108],["genome1","+",73],["genome2","+",73],["genome3","+",108]]
L   7   +   8   +   0M
L   7   +   12  +   0M
S   8   ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC   PO:J:[["utg0000002l","+",0],["genome1","+",74],["genome2","+",74],["genome3","+",110]]
S   9   TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTA  PO:J:[["utg0000003l","+",0],["genome3","+",0]]
L   9   +   4   +   0M
S   10  T   PO:J:[["utg0000003l","+",105],["genome3","+",105]]
L   10  +   5   +   0M
S   11  A   PO:J:[["utg0000003l","+",107],["genome3","+",107]]
L   11  +   7   +   0M
S   12  C   PO:J:[["utg0000003l","+",109],["genome3","+",109]]
L   12  +   8   +   0M
S   13  C   PO:J:[["genome1","+",21]]
L   13  +   3   +   0M
P   utg0000001l 1+,2+,3+,4+,5+,6+,7+    21M,1M,42M,7M,1M,1M,1M
P   utg0000002l 8+  289M
P   utg0000003l 9+,4+,10+,5+,11+,7+,12+ 98M,7M,1M,1M,1M,1M,1M
P   genome1 1+,13+,3+,4+,5+,6+,7+,8+    21M,1M,42M,7M,1M,1M,1M,289M
P   genome2 1+,2+,3+,4+,5+,6+,7+,8+ 21M,1M,42M,7M,1M,1M,1M,289M
P   genome3 9+,4+,10+,5+,11+,7+,12+,8+  98M,7M,1M,1M,1M,1M,1M,289M

ekg commented 3 years ago

Just out of curiosity, can you use vg deconstruct? That should give you what you're interested in (variants vs. embedded references). You'd need to pick the reference sequence(s) when you run it. Then you'll get a VCF describing the bubbles versus the reference sequences.

In the near future (sometime in the next week) we'll have a clean way to get MAF out of these graphs, using smoothxg. That should also solve your problem.

In fact, I suppose we can take the unitig GFA and store the minimap positions in the JSON - creating the "location graph". Perhaps I'm missing something in my reason, as I suppose you proposed re-inducing the graph using seqwish for a reason?

I am a little confused by this. The unitigs are embedded in the graph with the seqwish step. The xg index makes accessing and determining the positions randomly fast (not a problem for your small graph, but potentially useful on big ones). You don't need a separate kind of graph to represent the mapping. The graph directly represents it. What's missing for you is something that gives you coordinates for things that are near existing reference sequences. This could be given by vg deconstruct. We might also extend the positional annotation to give the nearest coordinates on any path for sequences that don't have any overlapping paths.

Sorry if this is trouble. It's helpful to me to work through a typical use case. I want to build a tool on top of xg that provides the kinds of interface that you're interested in here. As you see, xg nearly does this, but it's not user-friendly and doesn't quite answer all the questions you have.

p.s. maffer might be able to make MAF from this, solving your problem in yet another way.

rickbeeloo commented 3 years ago

Use case @ekg I'm a PhD student of Bas and we will have a meeting tomorrow so we can discuss everything there in more detail. In short, we want to extract shared sequences between genomes while allowing some variations (hence the bubble pop) and also retain sequences only present in a single genome. For all those sequences we want to know in what genomes they were present and at what locations. So basically:

ATCGATCGACTAGCACGACATC... genome1[start,end], genome2[start,end]...
GGGGGGGGGGGGACGACATC... genome10[start,end], genome3[start,end]...
CTAGCTACGGATCGACTACTGF... genome3[start,end]

Minimap confusion

I am a little confused by this. The unitigs are embedded in the graph with the seqwish step

Here I meant not running seqwish and gx at all. So original graph > bubble pop > unitigs. Then unitig + source fasta > minimap, this would already tell us the location of each unitig in the genomes (while allowing some variability). However, preferably we would not align everything over again as this would consume a lot of extra time for large sequence collections.

Maffer I just tried maffer on the original graph (before bubble pop etc.) however the alignment breaks on the SNP (of note I did not use odgi sort yet - if this would make a difference). Moreover, looking at the second block, we preferably want something like this to appear without the big gap for genome3, thus having the shared sequence between genome1 and genome2 (and not genome3) separately: GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG


a loops=false
s genome2 0 21 + 363 ATCGACTGACACGATCGACTA-
s genome1 0 22 + 363 ATCGACTGACACGATCGACTAC

a loops=false
s genome3 110 289 + 399 ----------------------------------------------------ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
s genome2  22 341 + 363 GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
s genome1  22 341 + 363 GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC

a loops=false
s genome2 21 1 + 363 G

a loops=false
s genome3 0 110 + 399 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC

Vg deconstruct I'm not sure what the references here should be?

If you have any other suggestions that would be useful to try before the meeting tomorrow, let me know :)