jasperlinthorst / reveal

Graph based multi genome aligner
MIT License
46 stars 3 forks source link

can't read in GFA generated by vg #13

Closed tdlong closed 7 years ago

tdlong commented 7 years ago

I generated a *.gfa file from an alignment in vg. When I run:

reveal bubbles -r ref test.hal2vg.gfa >test.hal2vg.bubbles Traceback (most recent call last): File "/usr/local/bin/reveal", line 11, in load_entry_point('reveal==0.1', 'console_scripts', 'reveal')() File "build/bdist.linux-x86_64/egg/reveal/reveal.py", line 157, in main File "build/bdist.linux-x86_64/egg/reveal/bubbles.py", line 12, in bubbles_cmd File "build/bdist.linux-x86_64/egg/reveal/utils.py", line 155, in read_gfa AssertionError

I made the gfa file using

vg view -pn test.hal2vg.vg > test.hal2vg.gfa

Here is the gfa file (54K in size):

http://wfitch.bio.uci.edu/~tdlong/test.hal2vg.gfa.gz

Are there different flavors of gfa??

Tony

jasperlinthorst commented 7 years ago

Hi Tony, The problem here seems to be that vg outputs edges with different orientations, probably to allow for inverting edges or so. Graphs seem to be encoded in a different way, and since I don't use that information, the parser fails when it encounters the same edge twice. Although I'm sure that there must be way around this problem, I don't really see any short term fixes at the moment. Sorry.

Cheers, Jasper

tdlong commented 7 years ago

Jasper:

Thanks for the quick reply.

I see, that is helpful. It tells me (I think) that ".gfa" is not really a standard format. It sort of sucks, since I think VG has a lot of wind behind it. It would be nice if there were a way for different programs to talk to one another. But non-standard standard formats has plagued bioinformatics for years...

It there some way to specify exactly what you program expects the ".gfa" to look like. Then I could attempt to write a translator. I am really only interested in representing graphs as something like a vcf. The concept of an allele at some position in the genome is very useful for carrying out GWAS type associations. So it is useful to be able to move between descriptions.

On Jun 20, 2017, at 9:43 AM, Jasper Linthorst notifications@github.com wrote:

Hi Tony, The problem here seems to be that vg outputs edges with different orientations, probably to allow for inverting edges or so. Graphs seem to be encoded in a different way, and since I don't use that information, the parser fails when it encounters the same edge twice. Although I'm sure that there must be way around this problem, I don't really see any short term fixes at the moment. Sorry.

Cheers, Jasper

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jasperlinthorst/reveal/issues/13#issuecomment-309818104, or mute the thread https://github.com/notifications/unsubscribe-auth/ATCNN0sbTyUeNBCkeZNp5d0FT7ZwLXgLks5sF_aqgaJpZM4N_1Hm.

jasperlinthorst commented 7 years ago

Hi Tony, I get your points and I agree. Although it has to be noted that GFA was intended to be some sort of standard to represent assembly graphs rather than these kind of alignment graphs. Also, I have to make certain assumptions about the graphs that Im working with, in my case this means that the graphs are acyclic and directional (not really the bidirectional kind of encoding that VG used here).

About solutions: I think there are a couple of ways to resolve these problems, first, simply put the six sequences that correspond to the 6 paths directly into reveal and use this graph to extract the vcf-like representation that you would get with the 'reveal bubbles' subcommand. Secondly, we could see if there's a way to preprocess/convert the 'double-edge bidirectional' representation from vg to a single edge directional representation, such that it can be handled by reveal. I would like to implement a proper GFA parser that would incorporate the second solution, but, as I'm a bit short on time it would take me a while before I can start to work on this. Help is appreciated and I would be happy to incorporate any contribution from your side.

Hope this helps.

Cheers, Jasper

tdlong commented 7 years ago

Jasper:

On Jun 21, 2017, at 1:38 AM, Jasper Linthorst notifications@github.com wrote:

Hi Tony, I get your points and I agree. Although it has to be noted that GFA was intended to be some sort of standard to represent assembly graphs rather than these kind of alignment graphs. Also, I have to make certain assumptions about the graphs that Im working with, in my case this means that the graphs are acyclic and directional (not really the bidirectional kind of encoding that VG used here).

I write software to! I get it.

About solutions: I think there are a couple of ways to resolve these problems, first, simply put the six sequences that correspond to the 6 paths directly into reveal and use this graph to extract the vcf-like representation that you would get with the 'reveal bubbles' subcommand. Secondly, we could see if there’s a way to preprocess/convert the 'double-edge bidirectional' representation from vg to a single edge directional

I am trying to work with your software now. I aligned two sequences and compared reveals bubbles to a full blown alignment and think reveal is getting all the variants correct. So this is very good.

results not shown

#############################

Now I am not sure how to run it with multiple sequences

It doesn't seem to be getting the correct answer. The encoding at bubbles includes numbered alleles (e.g., 0,1,2) that seem to correspond to the alleles listed in the "variant" field. But then some of the genotypes also include a "-". It also seems to be missing some {SNP} events.

(I split up the mfas above -- but excluded allele5.fasta {})

reveal align ref.fasta allele1.fasta allele2.fasta allele3.fasta allele4.fasta mv ref_allele1_allele2_allele3_allele4.gfa first4.gfa reveal bubbles -r ref.fasta first4.gfa > first4.bubbles cat first4.bubbles | cut -f 5,7-12

pos allele1.fasta allele2.fasta allele3.fasta allele4.fasta ref.fasta 55030 - 0 1 2 2 98151 - - 0 1 1 92570 - - 0 1 1 75817 - - 1 0 0 59168 - - 1 0 0 64045 - - - 1 0 62952 - - - 0 1 62057 - - - 1 0 61183 - - - 0 1 46679 - 0 1 0 0 44212 - 1 0 1 1 32216 - 0 1 0 0 30070 - 1 0 0 0 28773 - 0 1 0 0 23829 - 1 0 0 0 23318 - 1 0 1 1 565 - 1 0 1 1

cat first4.bubbles | cut -f 5-6 | cut -c-80

pos variant 55030 N,N,N 98151 ACAAGCAGAACTAGCGTAGCGTCAACACTGTCTTCCTGATGAGCTTTAACAGGGTAAAACTCTGACAAGTAAGT 92570 C,A 75817 A,G 59168 N,N 64045 G,T 62952 TTGCGGGTGTGCTTGATG,- 62057 C,T 61183 AAT,- 46679 ATTGACTTTTGATC,- 44212 TCTCAAACCGCAGAGTTGGGGCTGCAGTCATTTTGGTCGGTCTAGGCGACGGAGCCTAAGCGCGTCCAAGTTTA 32216 GCATCCCAGCAG,- 30070 T,A 28773 C,G 23829 AATCCTTGCTCTTAAT,- 23318 GTTATACGACTGG,- 565 ATGTAGTATGTGCATATATCGAGGGTACACTGTACCTATAAGTACACAGCAACACTTAGTTGCATTGCATAAATAA

jasperlinthorst commented 7 years ago

Hi Tony, I think things seem to work as intended. The problem here is that reveal creates nested/complex bubbles at sites where it can't properly anchor all five alleles (there's too much variation), as a result you get the variant calls that only apply to a subset of the alleles that you put in. What you could try is to see if reducing the minimum anchor length (-m) would help you and/or increase the -e value. Alternatively you can completely suppress the construction of these complex bubbles by specifying the minimum number of samples to anchor to 5, by supplying the -n5 option.

I don't know if you already did this, but try loading the resulting gfa in something like Bandage, it might help you to see what's going on. If you specify the --gml flag, reveal produces a graph in graphml format which you should be able to load in something like cytoscape, which should give you even more insight into what's going on.

If you send me the resulting gfa I can have a look at it.

Cheers, Jasper

tdlong commented 7 years ago

Jasper:

I will try this and try to send you some multi-sequence alignments later today.

These are a set of 5 artificial sequences. So they are like a positive control. We want to estimate the false positive and false negative rates for control sequences before calling variants genome-wide. There is not really all that much variation at all in these sequences. Most of the variation consists of large insertions and deletions affecting one allele. You can see this from the alignments (I will send). So I am concerned that the “bubbles” command is working, but the initial alignments are not correct.

Historically we have aligned using progressive cactus (I think last or lastz is the backend). This produces a “hal” file. The hal file is difficult to directly query (as its format is poorly described). But if you convert the hal to a msa the alignments look to be correct. Similarly, a cool thing about hal is that Santa Cruz Genome Browser can view them. When you manually look at alignments in SCGB, they look correct.

On Jun 22, 2017, at 12:25 AM, Jasper Linthorst notifications@github.com wrote:

Hi Tony, I think things seem to work as intended. The problem here is that reveal creates nested/complex bubbles at sites where it can't properly anchor all five alleles (there's too much variation), as a result you get the variant calls that only apply to a subset of the alleles that you put in. What you could try is to see if reducing the minimum anchor length (-m) would help you and/or increase the -e value. Alternatively you can completely suppress the construction of these complex bubbles by specifying the minimum number of samples to anchor to 5, by supplying the -n5 option.

I don't know if you already did this, but try loading the resulting gfa in something like Bandage, it might help you to see what's going on. If you specify the --gml flag, reveal produces a graph in graphml format which you should be able to load in something like cytoscape, which should give you even more insight into what's going on.

If you send me the resulting gfa I can have a look at it.

Cheers, Jasper

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jasperlinthorst/reveal/issues/13#issuecomment-310298496, or mute the thread https://github.com/notifications/unsubscribe-auth/ATCNN4AgjyJqVsik_HeC9pDA2urJBZyXks5sGhbZgaJpZM4N_1Hm.

tdlong commented 7 years ago

Jasper:

I took the reference and the first 4 alleles and aligned them using a traditional aligner. The reference is from Drosophila, and then we added variants artificially (but those variants could be known TEs).

http://www.ebi.ac.uk/Tools/ Tools > Multiple Sequence Alignment > Kalign

all 5 samples align really well. But the alleles different by several INDELs including long ones. Below is the alignment — which I have edited for clarity. I deleted large hunks of the alignment where the alignment was perfect (except perhaps for SNPs … that I suspect the graphs do a good job of). The graph that the aligner spits out in most cases should be able to detect these INDELs as single bubbles.

We included a couple of case where there is a variant within a bubble.

On Jun 22, 2017, at 12:25 AM, Jasper Linthorst notifications@github.com wrote:

Hi Tony, I think things seem to work as intended. The problem here is that reveal creates nested/complex bubbles at sites where it can't properly anchor all five alleles (there's too much variation), as a result you get the variant calls that only apply to a subset of the alleles that you put in. What you could try is to see if reducing the minimum anchor length (-m) would help you and/or increase the -e value. Alternatively you can completely suppress the construction of these complex bubbles by specifying the minimum number of samples to anchor to 5, by supplying the -n5 option.

I don't know if you already did this, but try loading the resulting gfa in something like Bandage, it might help you to see what's going on. If you specify the --gml flag, reveal produces a graph in graphml format which you should be able to load in something like cytoscape, which should give you even more insight into what's going on.

If you send me the resulting gfa I can have a look at it.

Cheers, Jasper

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jasperlinthorst/reveal/issues/13#issuecomment-310298496, or mute the thread https://github.com/notifications/unsubscribe-auth/ATCNN4AgjyJqVsik_HeC9pDA2urJBZyXks5sGhbZgaJpZM4N_1Hm.

CLUSTAL multiple sequence alignment by Kalign (2.0)

ref AGAAAATCGAAGAAGTCTAAGAAGACTACTACTGATAATGTGGTTGAATCAGCCGTCGAT allele1 AGAAAATCGAAGAAGTCTAAGAAGACTACTACTGATAATGTGGTTGAATCAGCCGTCGAT allele2 AGAAAATCGAAGAAGTCTAAGAAGACTACTACTGATAATGTGGTTGAATCAGCCGTCGAT allele3 AGAAAATCGAAGAAGTCTAAGAAGACTACTACTGATAATGTGGTTGAATCAGCCGTCGAT allele4 AGAAAATCGAAGAAGTCTAAGAAGACTACTACTGATAATGTGGTTGAATCAGCCGTCGAT

Delete several KB

ref ATTTCCGTGCCATCTAAGAAAGAT------------------------------------ allele1 ATTTCCGTGCCATCTAAGAAAGAT------------------------------------ allele2 ATTTCCGTGCCATCTAAGAAAGAT------------------------------------ allele3 ATTTCCGTGCCATCTAAGAAAGATATGTAGTATGTGCATATATCGAGGGTACACTGTACC allele4 ATTTCCGTGCCATCTAAGAAAGAT------------------------------------

Delete several KB

ref -------------------------------------------------------ACGCA allele1 -------------------------------------------------------ACGCA allele2 -------------------------------------------------------ACGCA allele3 CCCAACTGCAAGGAAAACACGTGTTCTCAATTGGTGGCATATATTGGTTTATTACACGCA allele4 -------------------------------------------------------ACGCA

ref ATCGGTCCTTCCCTCTGTAGGGAAAAAGGAAACTGTGGTCGAAAAATCTGTGATAAAGAA allele1 ATCGGTCCTTCCCTCTGTAGGGAAAAAGGAAACTGTGGTCGAAAAATCTGTGATAAAGAA allele2 ATCGGTCCTTCCCTCTGTAGGGAAAAAGGAAACTGTGGTCGAAAAATCTGTGATAAAGAA allele3 ATCGGTCCTTCCCTCTGTAGGGAAAAAGGAAACTGTGGTCGAAAAATCTGTGATAAAGAA allele4 ATCGGTCCTTCCCTCTGTAGGGAAAAAGGAAACTGTGGTCGAAAAATCTGTGATAAAGAA

Delete several KB

ref TCCAGACGATGAAGGCATTGACGAAGTGGCGGTAAAGTCACCCACTGATCAGCCAACAAC allele1 TCCAGACGATGAAGGCATTGACGAAGTGGCGGTAAAGTCACCCACTGATCAGCCAACAAC allele2 TCCAGACGATGAAGGCATTGACGAAGTGGCGGTAAAGTCACCCACTGATCAGCCAACAAC allele3 TCCAGACGATGAAGGCATTGACGAAGTGGCGGTAAAGTCACCCACTGATCAGCCAACAAC allele4 TCCAGACGATGAAGGCATTGACGAAGTGGCGGTAAAGTCACCCACTGATCAGCCAACAAC

ref ATGGTCAG-------------CAATTGTCCAAACCGAGACAACTGTGTTCCCAACACCAG allele1 ATGGTCAG-------------CAATTGTCCAAACCGAGACAACTGTGTTCCCAACACCAG allele2 ATGGTCAG-------------CAATTGTCCAAACCGAGACAACTGTGTTCCCAACACCAG allele3 ATGGTCAGGTTATACGACTGGCAATTGTCCAAACCGAGACAACTGTGTTCCCAACACCAG allele4 ATGGTCAG-------------CAATTGTCCAAACCGAGACAACTGTGTTCCCAACACCAG

ref ATGTACAAAAGGACAATGAAAAGACTCCTGATTGGTCTACTCATGTTGTGAGCAAAACAA allele1 ATGTACAAAAGGACAATGAAAAGACTCCTGATTGGTCTACTCATGTTGTGAGCAAAACAA allele2 ATGTACAAAAGGACAATGAAAAGACTCCTGATTGGTCTACTCATGTTGTGAGCAAAACAA allele3 ATGTACAAAAGGACAATGAAAAGACTCCTGATTGGTCTACTCATGTTGTGAGCAAAACAA allele4 ATGTACAAAAGGACAATGAAAAGACTCCTGATTGGTCTACTCATGTTGTGAGCAAAACAA

Delete a few hundred bp

ref TAGAAGATTATGTGATAATTGAACCCGAAGCGATACCTGAAATCAATTCCGAAATCCTTG allele1 TAGAAGATTATGTGATAATTGAACCCGAAGCGATACCTGAAATCAATTCCGAAATCCTTG allele2 TAGAAGATTATGTGATAATTGAACCCGAAGCGATACCTGAAATCAATTCCGA-------- allele3 TAGAAGATTATGTGATAATTGAACCCGAAGCGATACCTGAAATCAATTCCGAAATCCTTG allele4 TAGAAGATTATGTGATAATTGAACCCGAAGCGATACCTGAAATCAATTCCGAAATCCTTG

ref CTCTTAATCTTTATCTTGACCAAACCAATATATTGCCCAAGACTCGTATAGATGTCGACA allele1 CTCTTAATCTTTATCTTGACCAAACCAATATATTGCCCAAGACTCGTATAGATGTCGACA allele2 --------CTTTATCTTGACCAAACCAATATATTGCCCAAGACTCGTATAGATGTCGACA allele3 CTCTTAATCTTTATCTTGACCAAACCAATATATTGCCCAAGACTCGTATAGATGTCGACA allele4 CTCTTAATCTTTATCTTGACCAAACCAATATATTGCCCAAGACTCGTATAGATGTCGACA

Delete several kb

ref CCCACCGAAAACATTATGGTAACACAAACGGTGCATCATGGCCAGGAAACGATCCAGATC allele1 CCCACCGAAAACATTATGGTAACACAAACGGTGCATCATGGC------------------ allele2 CCCACCGAAAACATTATGGTAACACAAACGGTGCATCATGGCCAGGAAACGATCCAGATC allele3 CCCACCGAAAACATTATGGTAACACAAACGGTGCATCATGGCCAGGAAACGATCCAGATC allele4 CCCACCGAAAACATTATGGTAACACAAACGGTGCATCATGGCCAGGAAACGATCCAGATC

ref GACACGACTCGCAACAAGGATGTGCCCGATGAACCCGAAGATGTCCAGATTGAGGCTCGC allele1 ------------------------------------------------------------ allele2 GACACGACTCGCAACAAGGATGTGCCCGATGAACCCGAAGATGTCCAGATTGAGGCTCGC allele3 GACACGACTCGCAACAAGGATGTGCCCGATGAACCCGAAGATGTCCAGATTGAGGCTCGC allele4 GACACGACTCGCAACAAGGATGTGCCCGATGAACCCGAAGATGTCCAGATTGAGGCTCGC

ref TACCATCAACGACCCAAGGGCGATGTGGACCGCGCCACCGAGCTGATCCTGAAAAATGTA allele1 ------------------------------------------------------------ allele2 TACCATCAACGACCCAAGGGCGATGTGGACCGCGCCACCGAGCTGATCCTGAAAAATGTA allele3 TACCATCAACGACCCAAGGGCGATGTGGACCGCGCCACCGAGCTGATCCTGAAAAATGTA allele4 TACCATCAACGACCCAAGGGCGATGTGGACCGCGCCACCGAGCTGATCCTGAAAAATGTA

ref CCTCAGGCATTCGAAACCACTTTCGTGGAGCCAGATGAGACGACCACCGAAGTGATTGTG allele1 -----------------CACTTTCGTGGAGCCAGATGAGACGACCACCGAAGTGATTGTG allele2 CCTCAGGCATTCGAAACCACTTTCGTGGAGCCAGATGAGACGACCACCGAAGTGATTGTG allele3 CCTCAGGCATTCGAAAGCACTTTCGTGGAGCCAGATGAGACGACCACCGAAGTGATTGTG allele4 CCTCAGGCATTCGAAACCACTTTCGTGGAGCCAGATGAGACGACCACCGAAGTGATTGTG

Delete several kb

ref CTGCCAGTGCATATAACTGAGCAGAACAAGGTGCTCATTGCATCCCAGCAGAGCAAACGA allele1 CTGCCAGTGCATATAACTGAGCAGAACAAGGTGCTCATTGCATCCCAGCAGAGCAAACGA allele2 CTGCCAGTGCATATAACTGAGCAGAACAAGGTGCTCATTGCATCCCAGCAGAGCAAACGA allele3 CTGCCAGTGCATATAACTGAGCAGAACAAGGTGCTCATT------------AGCAAACGA allele4 CTGCCAGTGCATATAACTGAGCAGAACAAGGTGCTCATTGCATCCCAGCAGAGCAAACGA

ref TCCGGAGCAGGACCCACATCCTCAGCGGTGACCATCGAAGAGGTGGGCTCACCCACGGAG allele1 TCCGGAGCAGGACCCACATCCTCAGCGGTGACCATCGAAGAGGTGGGCTCACCCACGGAG allele2 TCCGGAGCAGGACCCACATCCTCAGCGGTGACCATCGAAGAGGTGGGCTCACCCACGGAG allele3 TCCGGAGCAGGACCCACATCCTCAGCGGTGACCATCGAAGAGGTGGGCTCACCCACGGAG allele4 TCCGGAGCAGGACCCACATCCTCAGCGGTGACCATCGAAGAGGTGGGCTCACCCACGGAG

Delete several kb

ref GAGCGAGAGCTGCAGGAGATCTACCTGACCATGACCAGCATGAAGGGAGTCATCAAGAAC allele1 GAGCGAGAGCTGCAGGAGATCTACCTGACCATGACCAGCATGAAGGGAGTCATCAAGAAC allele2 GAGCGAGAGCTGCAGGAGATCTACCTGACCATGACCAGCATGAAGGGAGTCATCAAGAAC allele3 GAGCGAGAGCTGCAGGAGATCTACCTGACCATGACCAGCATGAAGGGAGTCATCAAGAAC allele4 GAGCGAGAGCTGCAGGAGATCTACCTGACCATGACCAGCATGAAGGGAGTCATCAAGAAC

ref GAGGAGGAGCTCTGTCTGTACATCGAACGAGTCCAAGTGCTACGCACT------------ allele1 GAGGAGGAGCTCTGTCTGTACATCGAACGAGTCCAAGTGCTACGCACTGTATCGTTTGTA allele2 GAGGAGGAGCTCTGTCTGTACATCGAACGAGTCCAAGTGCTACGCACT------------ allele3 GAGGAGGAGCTCTGTCTGTACATCGAACGAGTCCAAGTGCTACGCACT------------ allele4 GAGGAGGAGCTCTGTCTGTACATCGAACGAGTCCAAGTGCTACGCACT------------

ref ------------------------------------------------------------ allele1 TCCTAAATATACCTTTGGGATTCGTTTGGTACGATCTTACTGTTGGCAATCTTTGGTGTT allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 ------------------------------------------------------------

Delete a few hundred bp

ref ------------------------------------------------------------ allele1 ATTGCTACAAGCGAATCATATCCAGGATATATGCAGTGCAGTATCCCCGGCGATGCAGGT allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 ------------------------------------------------------------

ref -----------------------------------------CGCGTCGGATTCATTGGCA allele1 ATGGTGATCGATTCTCGACTTGTTGTGGCACTGCAAATGTTCGCGTCGGATTCATTGGCA allele2 -----------------------------------------CGCGTCGGATTCATTGGCA allele3 -----------------------------------------CGCGTCGGATTCATTGGCA allele4 -----------------------------------------CGCGTCGGATTCATTGGCA

ref ACGAGCTTGGCAGGATCGGACTGCAAGAGCCCGCCATTGAGCCGGAAAAAGTTGGTGAGC allele1 ACGAGCTTGGCAGGATCGGACTGCAAGAGCCCGCCATTGAGCCGGAAAAAGTTGGTGAGC allele2 ACGAGCTTGGCAGGATCGGACTGCAAGAGCCCGCCATTGAGCCGGAAAAAGTTGGTGAGC allele3 ACGAGCTTGGCAGGATCGGACTGCAAGAGCCCGCCATTGAGCCGGAAAAAGTTGGTGAGC allele4 ACGAGCTTGGCAGGATCGGACTGCAAGAGCCCGCCATTGAGCCGGAAAAAGTTGGTGAGC

Delete several kb

ref TGTCTAGTGAGGTAAGTCCAAGTGTCCCATATTCATTGCCCAACACCTTGGTTTCCTGTT allele1 TGTCTAGTGAGGTAAGTCCAAGTGTCCCATATTCATTGCCCAACACCTTGGTTTCCTGTT allele2 TGTCTAGTGAGGTAAGTCCAAGTGTCCCATATTCATTGCCCAACACCTTGGTTTCCTGTT allele3 TGTCTAGTGAGGTAAGTCCAAGTGTCCCATATTCATTGCCCAACACCTTGGTTTCCTGTT allele4 TGTCTAGTGAGGTAAGTCCAAGTGTCCCATATTCATTGCCCAACACCTTGGTTTCCTGTT

ref CAAAAAACATGATTTGTCCAAAAAAAA--------------------------------- allele1 CAAAAAACATGATTTGTCCAAAAAAAA--------------------------------- allele2 CAAAAAACATGATTTGTCCAAAAAAAA--------------------------------- allele3 CAAAAAACATGATTTGTCCAAAAAAAATTCTCAAACCGCAGAGTTGGGGCTGCAGTCATT allele4 CAAAAAACATGATTTGTCCAAAAAAAA---------------------------------

ref ------------------------------------------------------------ allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 TTGGTCGGTCTAGGCGACGGAGCCTAAGCGCGTCCAAGTTTACATATTATAGCCGTCCTT allele4 ------------------------------------------------------------

Delete a few hundred bp

ref ------------------------------------------------------------ allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 GGGAAACATGAATGCTTCTGCAGCCGCGCGGGAGCATTTTCGCCTACGTCTAGCGGCCCA allele4 ------------------------------------------------------------

ref ---------------------------------TGGAAAATAATTAAATTTTTGATTTGT allele1 ---------------------------------TGGAAAATAATTAAATTTTTGATTTGT allele2 ---------------------------------TGGAAAATAATTAAATTTTTGATTTGT allele3 TCAGTCTCACCCTATATTTCATTTTAGTTCGCCTGGAAAATAATTAAATTTTTGATTTGT allele4 ---------------------------------TGGAAAATAATTAAATTTTTGATTTGT

ref TGATTTGTGAGCGCACCAGAAGTAAGTGAAACCCATTGGCAGTCTAACGAAAAAATATGA allele1 TGATTTGTGAGCGCACCAGAAGTAAGTGAAACCCATTGGCAGTCTAACGAAAAAATATGA allele2 TGATTTGTGAGCGCACCAGAAGTAAGTGAAACCCATTGGCAGTCTAACGAAAAAATATGA allele3 TGATTTGTGAGCGCACCAGAAGTAAGTGAAACCCATTGGCAGTCTAACGAAAAAATATGA allele4 TGATTTGTGAGCGCACCAGAAGTAAGTGAAACCCATTGGCAGTCTAACGAAAAAATATGA

Delete a few kb

ref GCGCAGCCTGGAGCCCATGTTGTCGTATCCCCACGGACCGCCACCCACGTAAAGCTGTCG allele1 GCGCAGCCTGGAGCCCATGTTGTCGTATCCCCACGGACCGCCACCCACGTAAAGCTGTCG allele2 GCGCAGCCTGGAGCCCATGTTGTCGTATCCCCACGGACCGCCACCCACGTAAAGCTGTCG allele3 GCGCAGCCTGGAGCCCATGTTGTCGTATCCCCACGGACCGCCACCCACGTAAAGCTGTCG allele4 GCGCAGCCTGGAGCCCATGTTGTCGTATCCCCACGGACCGCCACCCACGTAAAGCTGTCG

ref AGTTGCAATTGGA----------TTAGTGCAATGAAACACGATCAAAAAAAAAGTCAAGA allele1 AGTTGCAATTGGAGATCCTATAGTTAGTGCAATGAAACACGATCAAAAAAAAAGTCAAGA allele2 AGTTGCAATTGGA----------TTAGTGCAATGAAACACGATCAAAAAAAAAGTCAAGA allele3 AGTTGCAATTGGA----------TTAGTGCAATGAAACACGATCAAAAAAAAAGTCAAGA allele4 AGTTGCAATTGGA----------TTAGTGCAATGAAACACGATCAAAAAAAAAGTCAAGA

ref AACCCTATACCCTAAGAAGAACCCCCTATGAAAAACGCCCTATGAAAAAACCCAGTTTAA allele1 AACCCTATACCCTAAGAAGAACCCCCTATGAAAAACGCCCTATGAAAAAACCCAGTTTAA allele2 AACCCTATACCCTAAGAAGAACCCCCTATGAAAAACGCCCTATGAAAAAACCCAGTTTAA allele3 AACCCTATACCCTAAGAAGAACCCCCTATGAAAAACGCCCTATGAAAAAACCCAGTTTAA allele4 AACCCTATACCCTAAGAAGAACCCCCTATGAAAAACGCCCTATGAAAAAACCCAGTTTAA

Delete several kb

ref TGGCAGTGCGGATCTGACCTTTGATCAAATGTCCGAGCTTCCGCATCTGGACGCCTGCAT allele1 TGGCAGTGCGGATCTGACCTTTGATCAAATGTCCGAGCTTCCGCATCTGGACGCCTGCAT allele2 TGGCAGTGCGGATCTGACCTTTGATCAAATGTCCGAGCTTCCGCATCTGGACGCCTGCAT allele3 TGGCAGTGCGGATCTGACCTTTGATCAAATGTCCGAGCTTCCGCATCTGGACGCCTGCAT allele4 TGGCAGTGCGGATCTGACCTTTGATCAAATGTCCGAGCTTCCGCATCTGGACGCCTGCAT

ref ATATGGTGGGAAATTCTATCTT-------------------------------------- allele1 ATATGGTGGGAAATTCTATCTTTCAGGTGCACTGTCTAGTATGTAGCTGCTTAACTTCAG allele2 ATATGGTGGGAAATTCTATCTT-------------------------------------- allele3 ATATGGTGGGAAATTCTATCTT-------------------------------------- allele4 ATATGGTGGGAAATTCTATCTT--------------------------------------

ref ------------------------------------------------------------ allele1 TCCGATGTGCTCTGGGTTAACCCACTACGCGCCACAATGAATGCTAGCGTCAGCGTTCGC allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 ------------------------------------------------------------

Delete a few hundred bp

ref ------------------------------------------------------------ allele1 TCGGTGACACAAGACCTATCAGTGAAGAATGTTCTATGGGATATGCTCGTGTGCCCGGGT allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 ------------------------------------------------------------

ref --------------------------------------------------------GATC allele1 CTGCATAACTCACCTAGTGGCATAGCAGAAATCTATAAACCGCTCACTCGGGGTCGGATC allele2 --------------------------------------------------------GATC allele3 --------------------------------------------------------GATC allele4 --------------------------------------------------------GATC

ref TCTTGATATTCAGTTTCTTTAACTACATTATAAAATTTTATAATTTTAGAAACTTTGCGT allele1 TCTTGATATTCAGTTTCTTTAACTACATTATAAAATTTTATCATTTTAGAAACTTTGCGT allele2 TCTTGATATTCAGTTTCTTTAACTACATTATAAAATTTTATAATTTTAGAAACTTTGCGT allele3 TCTTGATATTCAGTTTCTTTAACTACATTATAAAATTTTATAATTTTAGAAACTTTGCGT allele4 TCTTGATATTCAGTTTCTTTAACTACATTATAAAATTTTATAATTTTAGAAACTTTGCGT

Delete several kb

ref AAAAATAGGTGCTGGCACTTACTTCGGATCCACGTG-------------------AAAAC allele1 AAAAATAGGTGCTGGCACTTACTTCGGATCCACGTG-------------------AAAAC allele2 AAAAATAGGTGCTGGCACTTACTTCGGATCCACGTG-------------------AAAAC allele3 AAAAATAGGTGCTGGCACTTACTTCGGATCCACGTGGCTACGTCGCTACTCTTAAAAAAC allele4 AAAAATAGGTGCTGGCACTTACTTCGGATCCACGTG-------------------AAAAC

ref CCATCAGCTCTATATTTTTATCAGGAATGCAAATTCCCTTTGCGAAGCTTTTAAAATTGG allele1 CCATCAGCTCTATATTTTTATCAGGAATGCAAATTCCCTTTGCGAAGCTTTTAAAATTGG allele2 CCATCAGCTCTATATTTTTATCAGGAATGCAAATTCCCTTTGCGAAGCTTTTAAAATTGG allele3 CCATCAGCTCTATATTTTTATCAGGAATGCAAATTCCCTTTGCGAAGCTTTTAAAATTGG allele4 CCATCAGCTCTATATTTTTATCAGGAATGCAAATTCCCTTTGCGAAGCTTTTAAAATTGG

Delete a couple of kb

ref ACGTGTATAGCCAAAAGGAAATGTTGGCATTGGGACAAATTTCGTTTTGCAGGGAAAAAC allele1 ACGTGTATAGCCAAAAGGAAATGTTGGCATTGGGACAAATTTCGTTTTGCAGGGAAAAAC allele2 ACGTGTATAGCCAAAAGGAAATGTTGGCATTGGGACAAATTTCG---------------- allele3 ACGTGTATAGCCAAAAGGAAATGTTGGCATTGGGACAAATTTCGTTTTGCAGGGAAAAAC allele4 ACGTGTATAGCCAAAAGGAAATGTTGGCATTGGGACAAATTTCGTTTTGCAGGGAAAAAC

ref AGTTCATATCATCTAAGCCAGCAACGAGCATAACACTCGAGAAGCCTAAAACACGATTAC allele1 AGTTCATATCATCTAAGCCAGCAACGAGCATAACACTCGAGAAGCCTAAAACACGATTAC allele2 ------------------------------------------------------------ allele3 AGTTCATATCATCTAAGCCAGCAACGAGCATAACACTCGAGAAGCCTAAAACACGATTAC allele4 AGTTCATATCATCTAAGCCAGCAACGAGCATAACACTCGAGAAGCCTAAAACACGATTAC

ref TTTATGACTTTATTATTAACTCAGCCTGAATTACCTATTAAGACCAAGATGTTATTCATC allele1 TTTATGACTTTATTATTAACTCAGCCTGAATTACCTATTAAGACCAAGATGTTATTCATC allele2 ------------------------------------------------------------ allele3 TTTATGACTTTATTATTAACTCAGCCTGAATTACCTATTAAGACCAAGATGTTATTCATC allele4 TTTATGACTTTATTATTAACTCAGCCTGAATTACCTATTAAGACCAAGATGTTATTCATC

ref ATTCTCTTCGATTCCGTTGAGGCCTTTACTGATTTTTTTATTTTCCGACTTTATATGCTC allele1 ATTCTCTTCGATTCCGTTGAGGCCTTTACTGATTTTTTTATTTTCCGACTTTATATGCTC allele2 --------------------------------TTTTTTTATTTTCCGACTTTATATGCTC allele3 ATTCTCTTCGATTCCGTTGAGGCCTTTACTGATTTTTTTATTTTCCGACTTTATATGCTC allele4 ATTCTCTTCGATTCCGTTGAGGCCTTTACTGATTTTTTTATTTTCCGACTTTATATGCTC

ref TAGGTTCCATAAAATCTTTTTAATTAAGTTTGTCTCGTTCAACCGAATAAGTAACCGGAA allele1 TAGGTTCCATAAAATCTTTTTAATTAAGTTTGTCTCGTTCAACCGAATAAGTAACCGGAA allele2 TAGGTTCCATAAAATCTTTTTAATTAAGTTTGTCTCGTTCAACCGAATAAGTAACCGGAA allele3 TAGGTTCCATAAAATCTTTTTAATTAAGTTTGTCTCGTTCAACCGAATAAGTAACCGGAA allele4 TAGGTTCCATAAAATCTTTTTAATTAAGTTTGTCTCGTTCAACCGAATAAGTAACCGGAA

Delete several lb

ref TGGGTCAGATGTTCGACGATAGCGAACTGCAGGCTCTGATCGACGACAACGATCCGGAGG allele1 TGGGTCAGATGTTCGACGATAGCGAACTGCAGGCTCTGATCGACGACAACGATCCGGAGG allele2 TGGGTCAGATGTTCGACGATAGCGAACTGCAGGCTCTGATCGACGACAACGATCCGGAGG allele3 TGGGTCAGATGTTCGACGATAGCGAACTGCAGGCTCTGATCGACGACAACGATCCGGAGG allele4 TGGGTCAGATGTTCGACGATAGCGAACTGCAGGCTCTGATCGACGACAACGATCCGGAGG

ref ACACCGGCAAGGTTAACTTCGACGGCTTCTGCAGCATCGCTGCCCATTTCCTGGAAGAGG allele1 ACACCGGCAAGGTTAACTTCGACG------------------------------------ allele2 ACACCGGCAAGGTTAACTTCGACGGCTTCTGCAGCATCGCTGCCCATTTCCTGGAAGAGG allele3 ACACCGGCAAGGTTAACTTCGACGGCTTCTGCAGCATCGCTGCCCATTTCCTGGAAGAGG allele4 ACACCGGCAAGGTTAACTTCGACGGCTTCTGCAGCATCGCTGCCCATTTCCTGGAAGAGG

ref AGGATGCCGAGGCCATCCAGAAGGAGCTGAAAGAGGCCTTTCGTCTGTACGATCGCGAGG allele1 ------------------------------------------------------------ allele2 AGGATGCCGAGGCCATCCAGAAGGAGCTGAAAGAGGCCTTTCGTCTGTACGATCGCGAGG allele3 AGGATGCCGAGGCCATCCAGAAGGAGCTGAAAGAGGCCTTTCGTCTGTACGATCGCGAGG allele4 AGGATGCCGAGGCCATCCAGAAGGAGCTGAAAGAGGCCTTTCGTCTGTACGATCGCGAGG

Delete a few hundred bp

ref ATAAAACCAACAGAAAATCGAGCTTTTGGGTTTAATTATAATTTTTTATTATTTTTAAAT allele1 ------------------------------------------------------------ allele2 ATAAAACCAACAGAAAATCGAGCTTTTGGGTTTAATTATAATTTTTTATTATTTTTAAAT allele3 ATAAAACCAACAGAAAATCGAGCTTTTGGGTTTAATTATAATTTTTTATTATTTTTAAAT allele4 ATAAAACCAACAGAAAATCGAGCTTTTGGGTTTAATTATAATTTTTTATTATTTTTAAAT

ref AAATTAACTAACCGCTTTAATACAAAACTTTAGTTCTTGTGGTCTGCGAGCAGTAATAAT allele1 --ATTAACTAACCGCTTTAATACAAAACTTTAGTTCTTGTGGTCTGCGAGCAGTAATAAT allele2 AAATTAACTAACCGCTTTAATACAAAACTTTAGTTCTTGTGGTCTGCGAGCAGTAATAAT allele3 AAATTAACTAACCGCTTTAATACAAAACTTTAGTTCTTGTGGTCTGCGAGCAGTAATAAT allele4 AAATTAACTAACCGCTTTAATACAAAACTTTAGTTCTTGTGGTCTGCGAGCAGTAATAAT

Delete a few hundred bp

ref TACAAACAAAACATATAAGGACATGTTCTATGTTGAGGCTAAAGCCTCGGGGGACAACAT allele1 TACAAACAAAACATATAAGGACATGTTCTATGTTGAGGCTAAAGCCTCGGGGG------- allele2 TACAAACAAAACATATAAGGACATGTTCTATGTTGAGGCTAAAGCCTCGGGGG------- allele3 TACAAACAAAACATATAAGGACATGTTCTATGTTGAGGCTAAAGCCTCGGGGG------- allele4 TACAAACAAAACATATAAGGACATGTTCTATGTTGAGGCTAAAGCCTCGGGGGACAACAT

ref CGACATACTGCAACGTAAGCATTATGCCAGATGTCGATACGTAGCCGGCAGACACTGCAG allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 CGACATACTGCAACGTAAGCATTATGCCAGATGTCGATACGTAGCCGGCAGACACTGCAG

Delete a few kb -- the challenge is in this section

ref TCATAGAGGTAAGACTTTAGAAGTTTGTGTGTGCTTTCGGGTAGGGATATTTTAATTTTA allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 TCATAGAGGTAAGACTTTAGAAGTTTGTGTGTGCTTTCGGGTAGGGATATTTTAATTTTA

ref AACATTAGGCCGTCGAGCCAGACTTTGT---CGAATGCTTGGGATACGTCTAAAAATACT allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 AACATTAGGCCGTCGAGCCAGACTTTGTAATCGAATGCTTGGGATACGTCTAAAAATACT

ref GCTGTACAGTATTCGCGATATTCAAATGCAGTTCTTATTTCCGTTGTAATACGATTCACC allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 GCTGTACAGTATTCGCGATATTCAAATGCAGTTCTTATTTCCGTTGTAATACGATTCACC

Delete a few kb - another challenge

ref TTGCGCTGTGAGACGCCATTGGCGTTCCACGTAGCTATGCGTAAGGTAGCCATTATTTAT allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 TTGCGCTGTGAGACGCCATTGGCGTTCCACGTAGCTATGCGTAAGGTAGCCATTATTTAT

ref ------------------TTGATTGTTGGGCTACAAGCATTTGTATCAAAAGGTTTTGAT allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 TTGCGGGTGTGCTTGATGTTGATTGTTGGGCTACAAGCATTTGTATCAAAAGGTTTTGAT

ref TACGCATCATGTCTTGAATGGTGGTCTTCATAAATGTCATAAATTCCATCATGCTCTGTT allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 TACGCATCATGTCTTGAATGGTGGTCTTCATAAATGTCATAAATTCCATCATGCTCTGTT

Delete several kb

ref AGCATTTTTATTATTATTGGTGTTGGGTTCCCCTTGTCTACAAAATAGAAAAATCAACCA allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 AGCATTTTTATTATTATTGGTGTTGGGTTCCCCTTGTCTACAAAATAGAAAAATCAACCA

ref TTTAAACTTTCACCCACAGGACCAATCTTATCCTTCTCGTGTCTCTATCATTGGCATCTC allele1 ----------------------------------------------------------TC allele2 ----------------------------------------------------------TC allele3 ----------------------------------------------------------TC allele4 TTTAAACTTTCACCCACAGGACCAATCTTATCCTTCTCGTGTCTCTATCATTGGCATCTC

ref AAAGAATGGGCGACTTAACTCGTTTAGTTAAAGCGTACAAAAGCTGGCACAAAAATTAAT allele1 AAAGAATGGGCGACTTAACTCGTTTAGTTAAAGCGTACAAAAGCTGGCACAAAAATTAAT allele2 AAAGAATGGGCGACTTAACTCGTTTAGTTAAAGCGTACAAAAGCTGGCACAAAAATTAAT allele3 AAAGAATGGGCGACTTAACTCGTTTAGTTAAAGCGTACAAAAGCTGGCACAAAAATTAAT allele4 AAAGAATGGGCGACTTAACTCGTTTAGTTAAAGCGTACAAAAGCTGGCACAAAAATTAAT

Delete several kb

ref AGATTTTCCCATTTCGGTCGTGGGTGTGCCCTGGTCTCTTTGTCTTTCAAGCTGCTACAA allele1 AGATTTTCCCATTTCGGTCGTGGGTGTGCCCTGGTCTCTTTGTCTTTCAAGCTGCTACAA allele2 AGATTTTCCCATTTCGGTCGTGGGTGTGCCCTGGTCTCTTTGTCTTTCAAGCTGCTACAA allele3 AGATTTTCCCATTTCGGTCGTGGGTGTGCCCTGGTCTCTTTGTCTTTCAAGCTGCTACAA allele4 AGATTTTCCCATTTCGGTCGTGGGTGTGCCCTGGTCTCTTTGTCTTTCAAGCTGCTACAA

ref TGCCCACATCAAAA---------------GGTGTTTTTCGTATTTGTTGCGCTTTCGAGG allele1 TGCCCACATCAAAA---------------GGTGTTTTTCGTATTTGTTGCGCTTTCGAGG allele2 TGCCCACATCAAAATGGAAAGAATGGAGGGGTGTTTTTCGTATTTGTTGCGCTTTCGAGG allele3 TGCCCACATCAAAA---------------GGTGTTTTTCGTATTTGTTGCGCTTTCGAGG allele4 TGCCCACATCAAAA---------------GGTGTTTTTCGTATTTGTTGCGCTTTCGAGG

ref GATTTTATCTGTTGTCTGTTGTTTGGCATTTGCTGCGATTTGTCATGGGCGCAACCGACT allele1 GATTTTATCTGTTGTCTGTTGTTTGGCATTTGCTGCGATTTGTCATGGGCGCAACCGACT allele2 GATTTTATCTGTTGTCTGTTGTTTGGCATTTGCTGCGATTTGTCATGGGCGCAACCGACT allele3 GATTTTATCTGTTGTCTGTTGTTTGGCATTTGCTGCGATTTGTCATGGGCGCAACCGACT allele4 GATTTTATCTGTTGTCTGTTGTTTGGCATTTGCTGCGATTTGTCATGGGCGCAACCGACT

delete a few hundred bp

ref AAGTCATATAAAAGATTGTGAGTTAATCTATTAATATTTTAATAAATCCTTAAGT----G allele1 AAGTCATATAAAAGATTGTGAGTTAATCTATTAATATTTTAATAAATCCTTAAGT----G allele2 AAGTCATATAAAAGATTGTGAGTTAATCTATTAATATTTTAATAAATCCTTAAGTTTTAG allele3 AAGTCATATAAAAGATTGTGAGTTAATCTATTAATATTTTAATAAATCCTTAAGT----G allele4 AAGTCATATAAAAGATTGTGAGTTAATCTATTAATATTTTAATAAATCCTTAAGT----G

ref CTTTAAATTGCTTCCATTTATGGCTTAGATGTCTCGCAATATCCGAAGCTATTAACTATC allele1 CTTTAAATTGCTTCCATTTATGGCTTAGATGTCTCGCAATATCCGAAGCTATTAACTATC allele2 CTTTAAATTGCTTCCATTTATGGCTTAGATGTCTCGCAATATCCGAAGCTATTAACTATC allele3 CTTTAAATTGCTTCCATTTATGGCTTAGATGTCTCGCAATATCCGAAGCTATTAACTATC allele4 CTTTAAATTGCTTCCATTTATGGCTTAGATGTCTCGCAATATCCGAAGCTATTAACTATC

Delete several kb

ref CTTGAAATTCAAGTTTTTTTTTCTCACAAAGGTGTGTGCGTGTGTATCTTGATCTTTGAG allele1 CTTGAAATTCAAGTTTTTTTTTCTCACAAAGGTGTGTGCGTGTGTATCTTGATCTTTGAG allele2 CTTGAAATTCAAGTTTTTTTTTCTCACAAAGGTGTGTGCGTGTGTATCTTGATCTTTGAG allele3 CTTGAAATTCAAGTTTTTTTTTCTCACAAAGGTGTGTGCGTGTGTATCTTGATCTTTGAG allele4 CTTGAAATTCAAGTTTTTTTTTCTCACAAAGGTGTGTGCGTGTGTATCTTGATCTTTGAG

ref TTAGCATCGCAGTGCTGTAGTTGTTGTTGTTGTTGTTCGTGCTCGTATTTCTTTAGCTGC allele1 TTAGCATCGCAGTGCTGTAGTTGTTGTTGTTGTTGTTCGTGCTCGTATTTCTTTAGCTGC allele2 TTAGCATCGCAGTGCTGTAGTTGTTGTTGTTGTTGTTCGT-------------------- allele3 TTAGCATCGCAGTGCTGTAGTTGTTGTTGTTGTTGTTCGTGCTCGTATTTCTTTAGCTGC allele4 TTAGCATCGCAGTGCTGTAGTTGTTGTTGTTGTTGTTCGTGCTCGTATTTCTTTAGCTGC

ref TTATTATAACTGACATATCTAGATGGGGAATACGTACGTACGTACATATGGTGGGGAGGT allele1 TTATTATAACTGACATATCTAGATGGGGAATACGTACGTACGTACATATGGTGGGGAGGT allele2 ------------------------------------------------------------ allele3 TTATTATAACTGACATATCTAGATGGGGAATACGTACGTACGTACATATGGTGGGGAGGT allele4 TTATTATAACTGACATATCTAGATGGGGAATACGTACGTACGTACATATGGTGGGGAGGT

delete ~150 bp

ref ACCATTGCCAACTAATAGAACTCCGCCTCCTGTTCTCCCGAGCGCATCCACAAGATAGTA allele1 ACCATTGCCAACTAATAGAACTCCGCCTCCTGTTCTCCCGAGCGCATCCACAAGATAGTA allele2 ------------------------------------------------------------ allele3 ACCATTGCCAACTAATAGAACTCCGCCTCCTGTTCTCCCGAGCGCATCCACAAGATAGTA allele4 ACCATTGCCAACTAATAGAACTCCGCCTCCTGTTCTCCCGAGCGCATCCACAAGATAGTA

ref TCCCACCGCCACCGCCAAACGCATCCCCAAGATATCAGTTTTACAGGCTCTCCGCGGATG allele1 TCCCACCGCCACCGCCAAACGCATCCCCAAGATATCAGTTTTACAGGCTCTCCGCGGATG allele2 --------------------GCATCCCCAAGATATCAGTTTTACAGGCTCTCCGCGGATG allele3 TCCCACCGCCACCGCCAAACGCATCCCCAAGATATCAGTTTTACAGGCTCTCCGCGGATG allele4 TCCCACCGCCACCGCCAAACGCATCCCCAAGATATCAGTTTTACAGGCTCTCCGCGGATG

ref GTCTCATTAGGCGCACAGTAGTCGTTGCTAATTATGGCCTAGACTCTGAGGCTTCCCATT allele1 GTCTCATTAGGCGCACAGTAGTCGTTGCTAATTATGGCCTAGACTCTGAGGCTTCCCATT allele2 GTCTCATTAGGCGCACAGTAGTCGTTGCTAATTATGGCCTAGACTCTGAGGCTTCCCATT allele3 GTCTCATTAGGCGCACAGTAGTCGTTGCTAATTATGGCCTAGACTCTGAGGCTTCCCATT allele4 GTCTCATTAGGCGCACAGTAGTCGTTGCTAATTATGGCCTAGACTCTGAGGCTTCCCATT

Delete several kb

ref ACCGAAAAAACCCCATAGAAATCAGGGGAAATAAAAAAGCACAAAAGGTTGGAGCCAAAA allele1 ACCGAAAAAACCCCATAGAAATCAGGGGAAATAAAAAAGCACAAAAGGTTGGAGCCAAAA allele2 ACCGAAAAAACCCCATAGAAATCAGGGGAAATAAAAAAGCACAAAAGGTTGGAGCCAAAA allele3 ACCGAAAAAACCCCATAGAAATCAGGGGAAATAAAAAAGCACAAAAGGTTGGAGCCAAAA allele4 ACCGAAAAAACCCCATAGAAATCAGGGGAAATAAAAAAGCACAAAAGGTTGGAGCCAAAA

ref AGTAGAACAAGATCGA-------------------------------------------- allele1 AGTAGAACAAGATCGA-------------------------------------------- allele2 AGTAGAACAAGATCGA-------------------------------------------- allele3 AGTAGAACAAGATCGAACAAGCAGAACTAGCGTAGCGTCAACACTGTCTTCCTGATGAGC allele4 AGTAGAACAAGATCGA--------------------------------------------

ref ------------------------------------------------------------ allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 TTTAACAGGGTAAAACTCTGACAAGTAAGTTACAAGTGCTCGCTGTAGAGCTATACCGTG allele4 ------------------------------------------------------------

Delete a few hundred bp

ref ------------------------------------------------------------ allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 GGGACTTAGCGTCGTAAACGACACAAAAACCCGGAGACCCTTCCTCACGCGCAGATATTC allele4 ------------------------------------------------------------

ref -------------------------TCGCTGACTCTCGATTCGATTCGATTTCGTTTGAT allele1 -------------------------TCGCTGACTCTCGATTCGATTCGATTTCGTTTGAT allele2 -------------------------TCGCTGACTCTCGATTCGATTCGATTTCGTTTGAT allele3 CGACATGTTCACAGCCATCACTACGTCGCTGACTCTCGATTCGATTCGATTTCGTTTGAT allele4 -------------------------TCGCTGACTCTCGATTCGATTCGATTTCGTTTGAT

ref TAGACAACAAAGTGCCTGTTTTTTTTTTTTTTTTCTTTGGTTGGCTTACTATTGTTCGGA allele1 TAGACAACAAAGTGCCTGTTTTTTTTTTTTTTTTCTTTGGTTGGCTTACTATTGTTCGGA allele2 TAGACAACAAAGTGCCTGTTTTTTTTTTTTTTTTCTTTGGTTGGCTTACTATTGTTCGGA allele3 TAGACAACAAAGTGCCTGTTTTTTTTTTTTTTTTCTTTGGTTGGCTTACTATTGTTCGGA allele4 TAGACAACAAAGTGCCTGTTTTTTTTTTTTTTTTCTTTGGTTGGCTTACTATTGTTCGGA

Delete several kb

ref AGGCTGCTGCTGTTGCGATTCGAACTCCATTTGGATTCGGAGGCGGCTC------CTCAG allele1 AGGCTGCTGCTGTTGCGATTCGAACTCCATTTGGATTCGGAGGCGGCTCTAAAGCCTCAG allele2 AGGCTGCTGCTGTTGCGATTCGAACTCCATTTGGATTCGGAGGCGGCTC------CTCAG allele3 AGGCTGCTGCTGTTGCGATTCGAACTCCATTTGGATTCGGAGGCGGCTC------CTCAG allele4 AGGCTGCTGCTGTTGCGATTCGAACTCCATTTGGATTCGGAGGCGGCTC------CTCAG

ref GGTCGTTGCACAGCAGCAGCAGTAGTCTGCAACGTCGTTCCCCATTTGTACTTTGCAATG allele1 GGTCGTTGCACAGCAGCAGCAGTAGTCTGCAACGTCGTTCCCCATTTGTACTTTGCAATG allele2 GGTCGTTGCACAGCAGCAGCAGTAGTCTGCAACGTCGTTCCCCATTTGTACTTTGCAATG allele3 GGTCGTTGCACAGCAGCAGCAGTAGTCTGCAACGTCGTTCCCCATTTGTACTTTGCAATG allele4 GGTCGTTGCACAGCAGCAGCAGTAGTCTGCAACGTCGTTCCCCATTTGTACTTTGCAATG

Delete a few kb

ref AATTAGTGCTTTTTACCAGCATAATAATTTCAATTTGAACCATAATCATCTGATATACAC allele1 AATTAGTGCTTTTTACCAGCATAATAATTTCAATTTGAACCATAATCATCTGATATACAC allele2 AATTAGTGCTTTTTACCAGCATAATAATTTCAATTTGAACCATAATCATCTGATATACAC allele3 AATTAGTGCTTTTTACCAGCATAATAATTTCAATTTGAACCATAATCATCTGATATACAC allele4 AATTAGTGCTTTTTACCAGCATAATAATTTCAATTTGAACCATAATCATCTGATATACAC

ref TGTAAGCCAGATCTATTAAGCGCATGGAAGGGAAGCCCTCATC----------------- allele1 TGTAAGCCAGATCTATTAAGCGCATGGAAGGGAAGCCCTCATC----------------- allele2 TGTAAGCCAGATCTATTAAGCGCATGGAAGGGAAGCCCTCATC----------------- allele3 TGTAAGCCAGATCTATTAAGCGCATGGAAGGGAAGCCCTCATCTATGATACGAAACATAG allele4 TGTAAGCCAGATCTATTAAGCGCATGGAAGGGAAGCCCTCATC-----------------

ref ------------------------------------------------------------ allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 CATAGCGATAAGCCGTGGTGCCACGGATAACGTGAGCCGTATTATGTCTAGATTCAATCT allele4 ------------------------------------------------------------

Delete ~300bp

ref ------------------------------------------------------------ allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 GCATCAATGTGGTGATCCAGTCTCAACTGCGATACACAGGGTCGACCCTTTGTCCGACGG allele4 ------------------------------------------------------------

ref -TGCACCTCCAGTGACCCCGACAGTTGGGCTGCATTCGCGCCTGGTGCATCCTTATCCTC allele1 -TGCACCTCCAGTGACCCCGACAGTTGGGCTGCATTCGCGCCTGGTGCATCCTTATCCTC allele2 -TGCACCTCCAGTGACCCCGACAGTTGGGCTGCATTCGCGCCTGGTGCATCCTTATCCTC allele3 TTGCACCTCCAGTGACCCCGACAGTTGGGCTGCATTCGCGCCTGGTGCATCCTTATCCTC allele4 -TGCACCTCCAGTGACCCCGACAGTTGGGCTGCATTCGCGCCTGGTGCATCCTTATCCTC

Delete a couple of kb

ref GAGCTGCAGATGGAAATGGGGATCCGCAGGCAGGAGGATCGAAAGGATCGCAGGGGGGAA allele1 GAGCTGCAGATGGAAATGGGGATCCGCAGGCAGGAGGATCGAAAGGATCGCAGGGGG--- allele2 GAGCTGCAGATGGAAATGGGGATCCGCAGGCAGGAGGATCGAAAGGATCGCAGGGGGGAA allele3 GAGCTGCAGATGGAAATGGGGATCCGCAGGCAGGAGGATCGAAAGGATCGCAGGGGGGAA allele4 GAGCTGCAGATGGAAATGGGGATCCGCAGGCAGGAGGATCGAAAGGATCGCAGGGGGGAA

ref CGCGCCGTCTGTGGCAGCGGCAAAAGGGCACACAAAAACGGCATAAATATTATGGCCAAG allele1 ----------------GCGGCAAAAGGGCACACAAAAACGGCATAAATATTATGGCCAAG allele2 CGCGCCGTCTGTGGCAGCGGCAAAAGGGCACACAAAAACGGCATAAATATTATGGCCAAG allele3 CGCGCCGTCTGTGGCAGCGGCAAAAGGGCACACAAAAACGGCATAAATATTATGGCCAAG allele4 CGCGCCGTCTGTGGCAGCGGCAAAAGGGCACACAAAAACGGCATAAATATTATGGCCAAG

ref CTGCGAACGAGCTGCAGGAGCCAAACATAAACGGGGTCTAGAGATGGTAAACAGGGCATA allele1 CTGCGAACGAGCTGCAGGAGCCAAACATAAACGGGGTCTAGAGATGGTAAACAGGGCATA allele2 CTGCGAACGAGCTGCAGGAGCCAAACATAAACGGGGTCTAGAGATGGTAAACAGGGCATA allele3 CTGCGAACGAGCTGCAGGAGCCAAACATAAACGGGGTCTAGAGATGGTAAACAGGGCATA allele4 CTGCGAACGAGCTGCAGGAGCCAAACATAAACGGGGTCTAGAGATGGTAAACAGGGCATA

Delete several kb

jasperlinthorst commented 7 years ago

I had a quick look. It seems like reveal branches of too quickly (it's not wrong, but I agree its not ideal) under the default settings. If I specify -e5 you get something that looks good to me. Just two complex bubbles, but those seem correct to me. The fact that there might be minor differences in the positioning of the calls can be explained by the fact that reveal does not necessarily left align indel calls, but I don't think this is wrong. I'll see if I can change the defaults, would that be enough for you?

Cheers, Jasper

tdlong commented 7 years ago

It is now much much better and very close to the truth. (I get that the INDELs need not be left aligned - that of course is arbitrary). It now only struggles with the "challenge" section. This challenge is not crazy these sorts of things happen in real genomes (imagine an insertion of an ALU in humans where some insertions are full length and capable of transposition and others of DOA - this is an important functional variant).

ref TACAAACAAAACATATAAGGACATGTTCTATGTTGAGGCTAAAGCCTCGGGGGACAACAT allele1 TACAAACAAAACATATAAGGACATGTTCTATGTTGAGGCTAAAGCCTCGGGGG------- allele2 TACAAACAAAACATATAAGGACATGTTCTATGTTGAGGCTAAAGCCTCGGGGG------- allele3 TACAAACAAAACATATAAGGACATGTTCTATGTTGAGGCTAAAGCCTCGGGGG------- allele4 TACAAACAAAACATATAAGGACATGTTCTATGTTGAGGCTAAAGCCTCGGGGGACAACAT

ref CGACATACTGCAACGTAAGCATTATGCCAGATGTCGATACGTAGCCGGCAGACACTGCAG allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 CGACATACTGCAACGTAAGCATTATGCCAGATGTCGATACGTAGCCGGCAGACACTGCAG

Delete a few kb -- the challenge is in this section

ref TCATAGAGGTAAGACTTTAGAAGTTTGTGTGTGCTTTCGGGTAGGGATATTTTAATTTTA allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 TCATAGAGGTAAGACTTTAGAAGTTTGTGTGTGCTTTCGGGTAGGGATATTTTAATTTTA

ref AACATTAGGCCGTCGAGCCAGACTTTGT---CGAATGCTTGGGATACGTCTAAAAATACT allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 AACATTAGGCCGTCGAGCCAGACTTTGTAATCGAATGCTTGGGATACGTCTAAAAATACT

ref GCTGTACAGTATTCGCGATATTCAAATGCAGTTCTTATTTCCGTTGTAATACGATTCACC allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 GCTGTACAGTATTCGCGATATTCAAATGCAGTTCTTATTTCCGTTGTAATACGATTCACC

Delete a few kb - another challenge

ref TTGCGCTGTGAGACGCCATTGGCGTTCCACGTAGCTATGCGTAAGGTAGCCATTATTTAT allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 TTGCGCTGTGAGACGCCATTGGCGTTCCACGTAGCTATGCGTAAGGTAGCCATTATTTAT

ref ------------------TTGATTGTTGGGCTACAAGCATTTGTATCAAAAGGTTTTGAT allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 TTGCGGGTGTGCTTGATGTTGATTGTTGGGCTACAAGCATTTGTATCAAAAGGTTTTGAT

ref TACGCATCATGTCTTGAATGGTGGTCTTCATAAATGTCATAAATTCCATCATGCTCTGTT allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 TACGCATCATGTCTTGAATGGTGGTCTTCATAAATGTCATAAATTCCATCATGCTCTGTT

Delete several kb

ref AGCATTTTTATTATTATTGGTGTTGGGTTCCCCTTGTCTACAAAATAGAAAAATCAACCA allele1 ------------------------------------------------------------ allele2 ------------------------------------------------------------ allele3 ------------------------------------------------------------ allele4 AGCATTTTTATTATTATTGGTGTTGGGTTCCCCTTGTCTACAAAATAGAAAAATCAACCA

ref TTTAAACTTTCACCCACAGGACCAATCTTATCCTTCTCGTGTCTCTATCATTGGCATCTC allele1 ----------------------------------------------------------TC allele2 ----------------------------------------------------------TC allele3 ----------------------------------------------------------TC allele4 TTTAAACTTTCACCCACAGGACCAATCTTATCCTTCTCGTGTCTCTATCATTGGCATCTC

ref AAAGAATGGGCGACTTAACTCGTTTAGTTAAAGCGTACAAAAGCTGGCACAAAAATTAAT allele1 AAAGAATGGGCGACTTAACTCGTTTAGTTAAAGCGTACAAAAGCTGGCACAAAAATTAAT allele2 AAAGAATGGGCGACTTAACTCGTTTAGTTAAAGCGTACAAAAGCTGGCACAAAAATTAAT allele3 AAAGAATGGGCGACTTAACTCGTTTAGTTAAAGCGTACAAAAGCTGGCACAAAAATTAAT allele4 AAAGAATGGGCGACTTAACTCGTTTAGTTAAAGCGTACAAAAGCTGGCACAAAAATTAAT

It calls this regions as:

source sink subgraph ref pos variant allele1.fasta allele2.fasta allele3.fasta allele4.fasta ref.fasta

44 56 44,45,46,47,48,49,50,51,52,53,54,55,56 ref.fasta 59168 N,N 1 1 1 0 0 52 55 52,53,54,55 ref.fasta 64045 G,T - - - 1 0 50 52 50,51,52 ref.fasta 62952 TTGCGGGTGTGCTTGATG,- - - - 0 1 47 50 47,48,49,50 ref.fasta 62057 C,T - - - 1 0 45 47 45,46,47 ref.fasta 61183 AAT,- - - - 0 1

You can see that it actually calls the two deletions in the reference relative to allele4. But it fails to note that a much larger region in deleted in alleles1-3 relative to ref and allele4.

Maybe it sees it ... but is just expressing what it is seeing funny. You can see the line whose "source" is labelled 44. It is calling a variant as an "N,N" that is present in allele 1-3. I think this is the big deletion. It is just calling it an N,N as opposed to a "ACAACAT...TTGGCATC,-"

On Jun 22, 2017, at 9:41 AM, Jasper Linthorst notifications@github.com wrote:

I had a quick look. It seems like reveal branches of too quickly (it's not wrong, but I agree its not ideal) under the default settings. If I specify -e5 you get something that looks good to me. Just two complex bubbles, but those seem correct to me. The fact that there might be minor differences in the positioning of the calls can be explained by the fact that reveal does not necessarily left align indel calls, but I don't think this is wrong. I'll see if I can change the defaults, would that be enough for you?

Cheers, Jasper

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jasperlinthorst/reveal/issues/13#issuecomment-310435948, or mute the thread https://github.com/notifications/unsubscribe-auth/ATCNN0MTSy6RfseXSApQ8zxNFYYT0B3cks5sGpkrgaJpZM4N_1Hm.

jasperlinthorst commented 7 years ago

To me, the graph seems to be correct... (at least for the part you send me here). The problem here is how to define the deleted part, as we don't know whether the deletion is with respect to ref or allele4, I can only distinguish what I call a 'complex' bubble, through which allele1, allele2 and allele3 follow one path and allele4 and ref the other. For this reason I output the variant as N,N.

Or in other words, the part you call: "ACAACAT...TTGGCATC" differs between allele4 and ref.

Within the complex bubble I can then again detect simple bubbles for which I can make a proper genotype call for the subset of allele4 and ref. I think it's a matter of defining the variant calls, but the representation does not seem to be incorrect or suboptimal, see graph.

screen shot 2017-06-23 at 13 22 02
tdlong commented 7 years ago

Jasper:

Thanks for all your help, this is an amazing piece of software.

I agree. The graph is correct (and from the point of view of a biologist interested in this region fairly unambiguous as to what is going on here). So the problem is how to represent this in a “flat file” variant call type frame work. The genotypes are correct, so the 0/1 calls could be used in some sort of genetic analysis … like a GWAS. I now see your point the way to write this down in VCF format, would be tri-allelic throughout the entire region, clearly less desirable.

Although the N,N is correct (since the event cannot be unambiguously described), it not illuminating. The polymorphism is really a “-“ versus a complex bubble. I wonder if these events could be scores as an “ID00000001,-“ . More like a footnote. ID00000001 could then be a 2nd file of subgraphs (or code to pull out a subgraph), one for each ID. That way if some ID was interesting it would be easy to pull out the subgraph and bubbles for it.

Perhaps there is not a simple solution.

Tony

On Jun 23, 2017, at 4:39 AM, Jasper Linthorst notifications@github.com wrote:

To me, the graph seems to be correct... (at least for the part you send me here). The problem here is how to define the deleted part, as we don't know whether the deletion is with respect to ref or allele4, I can only distinguish what I call a 'complex' bubble, through which allele1, allele2 and allele3 follow one path and allele4 and ref the other. For this reason I output the variant as N,N.

Or in other words, the part you call: "ACAACAT...TTGGCATC" differs between allele4 and ref.

Within the complex bubble I can then again detect simple bubbles for which I can make a proper genotype call for the subset of allele4 and ref. I think it's a matter of defining the variant calls, but the representation does not seem to be incorrect or suboptimal, see graph.

https://user-images.githubusercontent.com/130278/27480492-6f448ab8-5818-11e7-82c5-c7d9bae0dd65.png — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jasperlinthorst/reveal/issues/13#issuecomment-310643935, or mute the thread https://github.com/notifications/unsubscribe-auth/ATCNN1fIb_mPXgoPPfImow_pnMes_Syaks5sG6PegaJpZM4N_1Hm.

jasperlinthorst commented 7 years ago

Thanks. I agree that there might indeed be better ways to encode these kind of complex bubble structures. I also think that the graph itself in the end is the optimal way to represent them. Therefore I indeed also implemented the subcommand 'subgraph', which you can use to extract the subgraph that is formed by the complex bubble (or any other subgraph for that matter).

You can for instance run:

reveal subgraph 44 45 46 47 48 49 50 51 52 53 54 55 56

This will generate a file called "~tmp.gfa" that contains only the complex bubble for further manual inspection.

Jasper

tdlong commented 7 years ago

Awesome.

Now I really like my idea of a second file of ID’s. So this could be done genome wide.

remember how we have “.fasta” and “.mfasta”

maybe we should have “.gfa” and “.mgfa”

graphname graph graphname graph

On Jun 23, 2017, at 7:19 AM, Jasper Linthorst notifications@github.com wrote:

Thanks. I agree that there might indeed be better ways to encode these kind of complex bubble structures. I also think that the graph itself in the end is the optimal way to represent them. Therefore I indeed also implemented the subcommand 'subgraph', which you can use to extract the subgraph that is formed by the complex bubble (or any other subgraph for that matter).

You can for instance run:

reveal subgraph 44 45 46 47 48 49 50 51 52 53 54 55 56

This will generate a file called "~tmp.gfa" that contains only the complex bubble for further manual inspection.

Jasper

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jasperlinthorst/reveal/issues/13#issuecomment-310677753, or mute the thread https://github.com/notifications/unsubscribe-auth/ATCNN39-kyoSZGAPoDpECiIXsRtRki37ks5sG8lygaJpZM4N_1Hm.

jasperlinthorst commented 7 years ago

In case you're still interested in something similar to this.. I added a flag -e to 'reveal bubbles' that outputs all complex bubbles in the graph to a separate gfa file (or add -s, to write a file for each bubble).

Jasper