medvedevgroup / SibeliaZ

A fast whole-genome aligner based on de Bruijn graphs
http://medvedevgroup.com/
Other
141 stars 19 forks source link

Feature Request - Unaligned Sequences #1

Closed fbemm closed 5 years ago

fbemm commented 5 years ago

Dear Medvedev-Group,

SibeliaZ works fantastic also on large, repetitive genomes. I would like to feature request one thing though. Would it be possible to add unaligned sequences to the output MAF in the end? Some tools require a MAF file to be able to fully recover the input genomes used for the alignment

Thanks, Felix

iminkin commented 5 years ago

Hi @fbemm,

Thanks for you interest in SibeliaZ. I don't know if it is technically OK to keep unaligned sequences in MAF, but I will take a look. Can you give an example of a tool that requires that so I can do some tests?

fbemm commented 5 years ago

Hi @ilyaminkin,

one tool would be vgtools. The MAF produced by SibeliaZ could by directly parsed into a variation graph. At best the genomes/paths within the MAF file are complete so that no further post-processing is necessary (it might actually be necessary to remove or flatten duplication blocks but that is another thing).

iminkin commented 5 years ago

@fbemm

I received a similar question recently: https://twitter.com/erikgarrison/status/1095953644489461760

It could be possible that you will be better with GFA output. I will take a closer look at vgtools soon.

fbemm commented 5 years ago

GFA would be even better!

iminkin commented 5 years ago

Recently I wrote a preliminary version of the MAF -> GFA converter, I will probably release it next week.

fbemm commented 5 years ago

The last NCBI Hackathon came also up with one. Or is that actually yours? Anyway, happy to test it.

iminkin commented 5 years ago

Yes, I wrote it during the Hackathon :) I have to polish the code at bit, but the result seemed to be importable into vg.

fbemm commented 5 years ago

I just tested the Hackathon version. In case of a sequence in the fasta file, that is not aligned at all, so does not appear in blocks, there is a key error. Might be worth to consider for the final version. Happy to test it any time.

/maf_to_gfa1.py pangenome.maf pangenome.fasta > pangeno
me.gfa                                                                                                                                                     
Traceback (most recent call last):
  File "./maf_to_gfa1.py", line 135, in <module>
    sequence[seq_id].append((i, len(blocks) - 1, blocks[-1][0]))
KeyError: 'St39-1_v1.CTG522'
fbemm commented 5 years ago

The behaviour can be prevented by simply removing unaligned sequences from that sequence file passed to maf_to_gfa1.py. The resulting GFA file however looks incomplete. There are a lot of lines which simply contain "FAIL" and only a single path (P) although there should be at least as much as aligned sequences in the MAF file.

fbemm commented 5 years ago

The decomposed blocks look fine (out.maf).

##maf version=1

a
s       Species_Strain1.CTG1    19155   82      -       195898  ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s       Species_Strain2.CTG2    36920   82      -       297574  ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s       Species_Strain3.CTG3    3569    82      -       19134   ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s       Species_Strain4.CTG3    25018   82      -       30606   ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s       Species_Strain5.CTG3    25929   82      -       31604   ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s       Species_Strain6.CTG1    13302   82      +       41454   ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s       Species_Strain7.CTG2    290237  82      -       3342806 ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s       Species_Strain8.CTG1    77240   82      -       1786785 ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
fbemm commented 5 years ago

I was wrong paths are all there. vg throws some errors due to missing edges though.

edge 296431 = 296431R to 81283 = 81283R is not present. The GFA file is malformed!

fbemm commented 5 years ago

Path Subset -> 296429+,296431+,81283-,511432+

This is related to the fails:

FAIL L 296431 + 81283 - *

fbemm commented 5 years ago

I narrowed down the issue and it is actually not related to the GFA parser but overlapping MAF blocks.

iminkin commented 5 years ago

@fbemm

Were you able to accomplish what you wanted to using the MAF -> GFA converter?

BTW, the current dev branch is https://github.com/medvedevgroup/SibeliaZ/tree/v1.1.0, it will be merged into master soon. A notice: I switched the order of the arguments in the converter script because it now supports multiple input FASTA files.

fbemm commented 5 years ago

Yes it works great. Spotted the new argument order. One thing, although I half requested it, troubles vgtools. Paths with a single node, sequences fully unaligned, are causing a segfault in vgtools respectively the GFA importer. GFA specs talks of paths having nodeS. One could think of writing them to a separate fasta file and supply that to vgtools during GFA import. So it would look like this in the end:

MFA -> GFA + Fasta -> VG

Thanks for pushing this out!!! F

On 18 Apr 2019, Ilia Minkin notifications@github.com wrote:

@fbemm

Were you able to accomplish what you wanted to using the MAF -> GFA converter?

BTW, the current dev branch is https://github.com/medvedevgroup/SibeliaZ/tree/v1.1.0, it will be merged into master soon. A notice: I switched the order of the arguments in the converter script because it now supports multiple input FASTA files.

-- Felix Bemm Sindelfinger Str. 47 72070 Tübingen Tel. +49 173 6655117

iminkin commented 5 years ago

@fbemm I am glad it works. One concern I have is that most practical usages of this kind of graph probably relies on the GFA segments being projected to the original FASTA files somehow. In other words, segments should probably have their FASTA coordinates being stored somewhere. I don't know a vg-compatible way of doing so.

iminkin commented 5 years ago

Merged v1.1.0 into the master.