jasperlinthorst / reveal

Graph based multi genome aligner
MIT License
45 stars 3 forks source link

Is there some different between directly align using sequences and indirectly align using graph previous generated? #5

Closed sailgu closed 6 years ago

sailgu commented 8 years ago

Hi, I have six genome each about 1Gb, labed from a to f. For save time and memory consumption. Can i directly align a,b,c and align d,e,f respectively, then align their graph result. Is there some different.

jasperlinthorst commented 8 years ago

Hi juyouhui, Although the aim is to minimise this of course, the answer is, there could be... (and, given the size of your genomes, most likely yes). Reason for this is the fact that with every alignment step we lose some information, owing to the fact that nodes become smaller and the selection of MUMs in some occasions slightly changes. However, it all depends on the amount and type of variation between your genomes, but my guess would be that for the majority of your graph 'simple' bubbles will be exactly the same whatever way you construct your multi-genome graph.

I'm currently working on some features that are aimed at fixing this for exactly the type of case you are describing, as I'm convinced that for these typical cases we'll be able to produce exactly the same graph in both ways. Let me know if I could use your dataset as a testcase, for these features, as that would be really helpful to me.

To check how much information you are losing in the graph alignment wrt the 'true' multi-sequence alignment, you can try to rename one of the sequence files that is contained in the graph and simply align that back to the graph. Now you'd expect to see 100% identity in that alignment and that the graph you obtain from this alignment contains exactly the same number of nodes, but probably you'll see that in some occasions a slightly different path is chosen...

So, for instance:

reveal align a_b_c.gfa a'.fa

Anyway, make sure you update to the latest version of the code as a lot has been improved on this topic very recently.

Cheers, Jasper

sailgu commented 8 years ago

I am afraid the boss will not allow me upload data. But I am very pleasure to help you do the test and give you the two different result . Now i encounter a problem . Is the memory not enough ?

Reveal.error: Realloc for T failed.

jasperlinthorst commented 8 years ago

Yes, that means your running out of memory. I think its best to first generate a pairwise alignment of two of them and check the result of that. Maybe make some plots using "reveal plot a b -i". Are your genomes fragmented into multiple contigs, chromosomes or are they 'complete 1gb single contigs'?

sailgu commented 8 years ago

They are fragmented into multiple scaffolds. How much memory reveal need.

jasperlinthorst commented 8 years ago

13/15 bytes per base for the index, plus the graph.

Since there's no functionality yet to cluster scaffolds between multiple samples, you'll first have to figure out a global mapping between your scaffolds, then the size of your problem most likely reduces a lot as well. Reveal at the moment just aligns two or more single contigs. There are plans to implement these things, but time is short unfortunately...

Sent from my iPhone

On 25 jul. 2016, at 03:03, juyouhui notifications@github.com wrote:

They are fragmented into multiple scaffolds. How much memory reveal need.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jasperlinthorst commented 6 years ago

The 'finish' subcommand now addresses the part about draft genomes. For the order of aligning graphs and genomes, there's no proper answer, but the 'realign' subcommand should help in this direction as well. I'm closing this issue.