jasperlinthorst / reveal

Graph based multi genome aligner
MIT License
45 stars 3 forks source link

Iterative refinement for vg #30

Closed fbemm closed 5 years ago

fbemm commented 5 years ago

Hey Jasper,

I am trying to refine a multi genome graph. Does it make sense to run probcons and rem iteratively while decreasing the -m for rem? Also, to generate a vg compatible graph, I would need the original paths (of the input fasta). Since they are not carried over during rem from the transformed GFA (*-prefixed there), I would need to skip the transforms steps, correct? Or is there a way to project the rem paths back to the original fasta paths? ... I somehow feel that I asked this before :)

Hope you are fine! Felix

fbemm commented 5 years ago

Mh, I just realize that unzip and refine are actually removing the prefix versions. Is there a chance to prevent that?

jasperlinthorst commented 5 years ago

Hi Felix, First of all, great that you figured all these things out without proper documentation... :)

For the rem part the *-prefixed paths, that correspond to the untransformed input sequence, should be carried over to the gfa that is produced by the rem subcommand. Quite a lot changed here, so check out the dev-branch, maybe this is the reason it's not working for you. If so, let me know, then I'll do a new release.

For unzip and refine indeed you are right that these paths are at the moment not maintained, simply for the reason of simplifying the implementation. Conceptually this is all well possible, but as my current group does not see this as a priority I can't give this any attention at the moment.

To answer your question about iterative runs of refine, I don't really see a lot of use in doing that. I think it's more useful to stick with a single run in which you fix the confidence (at e.g. 90) of the probabilistic multiple sequence alignment, such that unconfidently aligned columns are not merged in the graph representation. This way, an allele like a VNTR (with many equally likely multiple sequence alignments) simply ends up as a multi-allelic bubble/locus, instead of a multitude of small indel events.

Cheers, Jasper

On Tue, 26 Feb 2019 at 16:40, Felix Bemm notifications@github.com wrote:

Mh, I just realize that unzip and refine are actually removing the prefix versions. Is there a chance to prevent that?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jasperlinthorst/reveal/issues/30#issuecomment-467488346, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH85jrbQJdPT8efqXT4CadxJA_rjj00ks5vRVVngaJpZM4bSdjd .

fbemm commented 5 years ago

Hi Jasper,

For the rem part the *-prefixed paths, that correspond to the untransformed input sequence, should be carried over to the gfa that is produced by the rem subcommand. Quite a lot changed here, so check out the dev-branch, maybe this is the reason it's not working for you. If so, let me know, then I'll do a new release.

It does, I simply overlooked it. Also, I am always working on the dev branch ;)

For unzip and refine indeed you are right that these paths are at the moment not maintained, simply for the reason of simplifying the implementation. Conceptually this is all well possible, but as my current group does not see this as a priority I can't give this any attention at the moment.

Completely reasonable! Especially since most people would project variants to a single reference for downstream analysis anyway. I started to work on a "reveal variants" to "vcf" converter. This is not planned on your side yet right?

To answer your question about iterative runs of refine, I don't really see a lot of use in doing that. I think it's more useful to stick with a single run in which you fix the confidence (at e.g. 90) of the probabilistic multiple sequence alignment, such that unconfidently aligned columns are not merged in the graph representation. This way, an allele like a VNTR (with many equally likely multiple sequence alignments) simply ends up as a multi-allelic bubble/locus, instead of a multitude of small indel events.

I also agree here. It does make sense to leave them as bubbles. In case one wants to further use the graph in vg there is still the chance to call snps or alike on it.

Cheers, Felix