Closed fbemm closed 5 years ago
Hi @fbemm,
Thanks for you interest in SibeliaZ. I don't know if it is technically OK to keep unaligned sequences in MAF, but I will take a look. Can you give an example of a tool that requires that so I can do some tests?
Hi @ilyaminkin,
one tool would be vgtools. The MAF produced by SibeliaZ could by directly parsed into a variation graph. At best the genomes/paths within the MAF file are complete so that no further post-processing is necessary (it might actually be necessary to remove or flatten duplication blocks but that is another thing).
@fbemm
I received a similar question recently: https://twitter.com/erikgarrison/status/1095953644489461760
It could be possible that you will be better with GFA output. I will take a closer look at vgtools soon.
GFA would be even better!
Recently I wrote a preliminary version of the MAF -> GFA converter, I will probably release it next week.
The last NCBI Hackathon came also up with one. Or is that actually yours? Anyway, happy to test it.
Yes, I wrote it during the Hackathon :) I have to polish the code at bit, but the result seemed to be importable into vg.
I just tested the Hackathon version. In case of a sequence in the fasta file, that is not aligned at all, so does not appear in blocks, there is a key error. Might be worth to consider for the final version. Happy to test it any time.
/maf_to_gfa1.py pangenome.maf pangenome.fasta > pangeno
me.gfa
Traceback (most recent call last):
File "./maf_to_gfa1.py", line 135, in <module>
sequence[seq_id].append((i, len(blocks) - 1, blocks[-1][0]))
KeyError: 'St39-1_v1.CTG522'
The behaviour can be prevented by simply removing unaligned sequences from that sequence file passed to maf_to_gfa1.py. The resulting GFA file however looks incomplete. There are a lot of lines which simply contain "FAIL" and only a single path (P) although there should be at least as much as aligned sequences in the MAF file.
The decomposed blocks look fine (out.maf).
##maf version=1
a
s Species_Strain1.CTG1 19155 82 - 195898 ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s Species_Strain2.CTG2 36920 82 - 297574 ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s Species_Strain3.CTG3 3569 82 - 19134 ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s Species_Strain4.CTG3 25018 82 - 30606 ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s Species_Strain5.CTG3 25929 82 - 31604 ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s Species_Strain6.CTG1 13302 82 + 41454 ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s Species_Strain7.CTG2 290237 82 - 3342806 ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
s Species_Strain8.CTG1 77240 82 - 1786785 ATCCGAAAGCTGTTTGGCTGGTTTCCGCTGTTGCCTTTTATTGTCTAGCTCTAGCTGTAGCGGCTCCACTTCGTGCCCATGC
I was wrong paths are all there. vg throws some errors due to missing edges though.
edge 296431 = 296431R to 81283 = 81283R is not present. The GFA file is malformed!
Path Subset -> 296429+,296431+,81283-,511432+
This is related to the fails:
FAIL L 296431 + 81283 - *
I narrowed down the issue and it is actually not related to the GFA parser but overlapping MAF blocks.
@fbemm
Were you able to accomplish what you wanted to using the MAF -> GFA converter?
BTW, the current dev branch is https://github.com/medvedevgroup/SibeliaZ/tree/v1.1.0, it will be merged into master soon. A notice: I switched the order of the arguments in the converter script because it now supports multiple input FASTA files.
Yes it works great. Spotted the new argument order. One thing, although I half requested it, troubles vgtools. Paths with a single node, sequences fully unaligned, are causing a segfault in vgtools respectively the GFA importer. GFA specs talks of paths having nodeS. One could think of writing them to a separate fasta file and supply that to vgtools during GFA import. So it would look like this in the end:
MFA -> GFA + Fasta -> VG
Thanks for pushing this out!!! F
On 18 Apr 2019, Ilia Minkin notifications@github.com wrote:
@fbemm
Were you able to accomplish what you wanted to using the MAF -> GFA converter?
BTW, the current dev branch is https://github.com/medvedevgroup/SibeliaZ/tree/v1.1.0, it will be merged into master soon. A notice: I switched the order of the arguments in the converter script because it now supports multiple input FASTA files.
-- Felix Bemm Sindelfinger Str. 47 72070 Tübingen Tel. +49 173 6655117
@fbemm I am glad it works. One concern I have is that most practical usages of this kind of graph probably relies on the GFA segments being projected to the original FASTA files somehow. In other words, segments should probably have their FASTA coordinates being stored somewhere. I don't know a vg-compatible way of doing so.
Merged v1.1.0 into the master.
Dear Medvedev-Group,
SibeliaZ works fantastic also on large, repetitive genomes. I would like to feature request one thing though. Would it be possible to add unaligned sequences to the output MAF in the end? Some tools require a MAF file to be able to fully recover the input genomes used for the alignment
Thanks, Felix