jasperlinthorst / reveal

Graph based multi genome aligner
MIT License
45 stars 3 forks source link

Difference between the original sequence and the alignment content #9

Closed LeilyR closed 7 years ago

LeilyR commented 7 years ago

Hi! I used the tool to align couple of E. coli genomes against each other. The result didn't match the original sequences i used as input. I am wondering if you know how it happened. I send you both the gfa file created by Reveal and the fasta file that I used. It seems S 4 contains C while the base at the same position from fasta file is something different. Is it how it is supposed to be? Thanks a lot! Leily

nc000913.3.fasta.gz

reveal.gfa.gz

jasperlinthorst commented 7 years ago

Hi Leily, I had a look at the graph you sent, but to me it looks ok. What I did was the following:

reveal extract reveal.gfa nc000913.3.fasta --width70 > tmp.fasta

So I extract the sequence through the graph that corresponds to the path of nc000913.3.fasta

Now tmp.fasta seems to be exactly the same as the nc000913.3.fasta file. Also when i align tmp.fasta with nc000913.3.fasta using:

reveal align nc000913.3.fasta tmp.fasta

It finds that they are exactly the same.

About "S 4", this refers to segment (or node) 4 in the graph.

S 4 C * ORI:Z:0;2 OFFSETS:Z:1806;1806 RC:i:2

From this line you can derive that both nc000913.3.fasta and nc010473.fasta have a C at position 1806 (zero-based).

What I generally do is that I use Bandage (https://github.com/rrwick/Bandage) to visualise the alignment graph. I hope this helps. Let me know if I misunderstood something...

Cheers, Jasper

LeilyR commented 7 years ago

Hi! Thanks a lot for the reply. I will check it again. I might have made a mistake in cutting the sequence. Also thanks a lot for the link! Best, Leily