medvedevgroup / SibeliaZ

A fast whole-genome aligner based on de Bruijn graphs
http://medvedevgroup.com/
Other
141 stars 19 forks source link

Overlapping MAF blocks #6

Closed fbemm closed 5 years ago

fbemm commented 5 years ago

I have the following MAF blocks:

s SP_St1-2_v1.CTG1 972562 116 - 974421     CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St9-1_v1.CTG4 2007581 116 + 3546038   CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St21-1_v1.CTG2 2011797 116 + 3562354  CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St14-3_v1.CTG53 117896 116 + 301556   CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St3-3_v1.CTG1 2007174 116 - 3540787   CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St1-2_v1.CTG272 3796 116 - 329270     CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St22-2_v1.CTG59 1546482 116 - 2763569 CGTCCTAGTGGTCCATCCATCCTCTGCTAAGGTATACGTCCTTACTAGAGGACAAGTCTATTACGTATGCATAAGCAGTCTCTTCGTACCCATGCATCTTAAACGAGAATCAGGCA
s SP_St9-1_v1.CTG4 1538259 104 - 3546038   ----ACCGCCACTATCTAGAGCGCTTTTAGATCCCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG
s SP_St3-3_v1.CTG1 1533415 104 + 3540787   ----ACCGCCACTATCTAGAGCGCTTTTAGATCCCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG
s SP_St22-2_v1.CTG59 1216889 104 + 2763569 ----ACCGCCACTATCTAGAGCGCTTTTAGATCCCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG
s SP_St1-2_v1.CTG272 325276 104 + 329270   ----ACCGCCACTATCTAGAGCGCTTTTAGATCCCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG
s SP_St21-1_v1.CTG2 1550359 104 - 3562354  ----ACCGCCACTATCTAGAGCGCTTTTAGATCCCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG
s SP_St1-2_v1.CTG1 1658 107 + 974421       GCTTACCGCCACTATCTAGAGCGCTTTTAGAT-CCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG

Looking at SP_St1-2_v1.CTG1 it seems as the two sub sequences from the blocks overlap:

CCCTCTTAGATAATTAAGGAAGTGAACAGCAAGCTATTAGAGATTAGCGAGGTTGCCTGATTCTCGTTTAAGATG
                                                      GCCTGATTCTCGTTTAAGATGCATGGGTACGAAGAGACTGCTTATGCATACGTAATAGACTTGTCCTCTAGTAAGGACGTATACCTTAGCAGAGGATGGATGGACCACTAGGACG

The subsequence from block 1 is in RC.

Is that an intended behaviour? This is also what is causing #1

iminkin commented 5 years ago

You can use the "no_overlap" branch, where I fixed the issue: https://github.com/medvedevgroup/SibeliaZ/tree/no_overlap

I also pushed the slightly refactored version of converter: https://github.com/medvedevgroup/SibeliaZ/blob/no_overlap/SibeliaZ-LCB/maf_to_gfa1.py

I will merge this fix into the master along with some other improvements this week, I finally have time to work on the code :)

fbemm commented 5 years ago

Good that I checked the dev branches facepalm